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ABSTRACT 


A  design  study  has  been  carried  out  for  a  general- 
purpose  signal  processing  computer  which  incorporates 
arithmetic  parallelism  in  a  microprocessor  structure. 
The  study  indicates  that  the  processor  (Advanced  Signal 
Processor,  ASP)  would  be  faster,  smaller,  simpler,  and 
less  costly  than  its  predecessor,  the  Fast  Digital  Pro¬ 
cessor  (FDP).  In  addition,  the  ASP  would  have  a  more 
sophisticated  in -out  system  than  the  FDP.  These  gains 
are  achievable  partially  because  of  newly  available  fast 
hardware  and  partially  due  to  the  architecture  of  the  ASP. 
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DESIGN  STUDY  OF  THE  ADVANCED  SIGNAL  PROCESSOR 

I.  INTRODUCTION 

Applications  such  as  radar,  speech  analysis  and  synthesis,  and  sonar, 
require  a  great  number  of  signal  processing  operations  to  implement  a  sys¬ 
tem.  The  advantages  of  carrying  out  these  operations  digitally  in  real  time 
have  become  well  established  in  recent  years.  This  design  study  describes 
the  Advanced  Signal  Processor  (ASP),  a  fast  programmable  signal  processor 
that  can  be  integrated  into  a  real-time  system  for  these  and  other  applica¬ 
tions.  It  is  emphasized  that  this  report  is  a  design  study  and  does  not  de¬ 
scribe  a  machine  that  has  been  built.  Plans  for  construction  of  the  ASP  are 
indefinite  at  this  time. 

Speed  is  the  prerequisite  in  a  real-time  system.  The  key  features  cf 
the  ASP  are  speed,  programmability,  communications,  and  compactness. 

The  ASP  will  be  slightly  faster  than  the  Fast  Digital  Processor  (FDP),  *  a 
Laboratory  computer  that  has  speed  enough  for  real-time  radar  and  speech 
applications.  Like  the  FDP,  the  ASP  is  a  general-purpose  processor  so 
that  rapid  spectral  analysis  and  other  signal  processing  functions  such  as 
windowing,  magnitude  taking,  and  thresholding  can  be  implemented  by 
programming. 

The  ASP  will  differ  from  the  FDP  in  communications  capability  and 
size.  The  FDP  was  designed  as  pert  of  a  Laboratory  computing  facility:  the 
ASP  has  been  designed  to  serve  as  part  of  a  real  (though  perhaps  experi¬ 
mental)  system.  The  FDP  was  given  only  minimum  (input -output)  commu¬ 
nications  capability.  Complicated  communications  is  handled  by  the  nearby 
Univac  1219  computer.  Experience  gained  in  integrating  the.  FDP  into  a 
real  radar  system  has  indicated  that  a  more  sophisticated  in-out  system 
would  have  been  quite  desirable.  Such  a  system  will  be  incorporated  into 
the  ASP  to  facilitate  communications  with  external  memories,  other  com¬ 
puters,  and  various  other  devices.  Also,  the  FDP  is  large  and  immobile, 
but  the  ASP  will  fit  into  a  medium  sized  airplane  while  retaining  and  actu¬ 
ally  surpassing  the  FDP*  s  processing  power. 
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As  orientation  to  a  description  of  the  ASP,  this  design  study  begins  by 

2 

noting  the  relationship  of  the  ASP  to  the  FDP  and  the  LX-1  microprocessor, 
two  general-purpose  processors  built  at  the  Laboratory.  Then  the  main  in¬ 
struction  classes  and  their  execution  times  are  described.  (Detailed  de¬ 
scriptions  of  the  instructions  appear  in  the  Appendix.  )  The  programming 
features  of  the  ASP,  which  make  it  attractive  for  signal  processing  applica¬ 
tions,  are  illustrated  by  ex¬ 


amples.  Important  hardware 
features  of  the  ASP  are  also 
described. 

II.  ASP  STRUCTURE 

The  ASP1  s  structure 
was  motivated  largely  by  a 
consideration  of  the  assets 
and  liabilities  of  the  FDP  and 
the  LX-1. 

A.  Features  of  the  FDP 

and  LX-1 

The  FDP  was  designed 
as  a  general-purpose  proces¬ 
sor  which  could  perform  sig¬ 
nal  processing  operations 
such  as  spectral  analysis  and 
digital  filtering  about  100 
times  faster  than  with  stand¬ 
ard  computers.  The  FDP, 
whose  structure  is  shown  in 
Fig.  1,  derives  its  speed 
from  three  basic  factors  - 


arithmetic  parallelism,  in¬ 
struction  cycle  overlap,  and  I*'  U  Structure  of  the  Fast  Digital 

Processor. 


fast  hardware.  The  four  arithmetic  elements  (AEs)  can  operate  in  parallel 

under  independent  control.  Each  contains  an  18  x  18  array  multiplier  in  ad- 

dition  to  adder  and  logic  function  hardware.  The  program  memory  M  is 

Si  b 

separate  from  the  data  memories  M  and  M  ,  and  a  three-level  overlap  of 
instructions  is  carried  out.  While  a  typical  instruction  is  being  executed, 
the  next  instruction  is  being  decoded,  and  a  third  instruction  is  being  fetched. 
The  FDP  was  built  from  Motorola  MECL  II  integrated  circuits,  the  fastest 
logic  line  available  at  the  time  of  design.  The  speed  of  the  FDP  is  such  that 
Doppler  processing  for  2048  range  gates,  including  a  64-point  fast  Fourier 
transform  (FFT)  and  various  other  operations  for  each  range  gate,  could  be 
performed  in  about  two  seconds.  These  operations  are  being  carried  out  by 
the  FDP  in  a  real-time  demonstration  radar  system. 

This  speed  is  the  key  asset  of  the  FDP  and  ought  to  be  retained,  and  if 
possible,  augmented  in  a  new  processor.  However,  an  important  liability 
of  the  FDP  is  its  great  complexity  and  associated  large  size  and  cost.  The 
physical  construction  of  the  FDP  was  designed  for  engineering  accessibility 
rather  than  small  size,  but  even  with  repackaging  the  FDP  would  remain  too 
large  for,  say,  an  airborne  radar  application.  A  desirable  goal  is  to  retain 
or  augment  the  FDP' s  speed  in  a  significantly  smaller  and  less  complex  ma¬ 
chine.  Three  important  aspects  of  the  FDP,  which  contribute  to  its  large 
size,  have  been  modified  in  the  ASP.  First,  the  basic  word  length  of  18  bits 
was  found  to  be  more  than  necessary  for  the  demonstration  radar  and  simi¬ 
lar  applications.  The  ASP  will  use  a  basic  12-bit  word  length  but  will  allow 
fast  24-bit  operations  when  desired.  Second,  the  number  of  arithmetic  units 
in  the  ASP  is  cut  down  to  allow  only  two -fold  arithmetic  parallelism.  Third, 
the  FDP  has  very  complex  control  (for  example,  all  AEs  are  controlled  in¬ 
dependently)  and  a  large  number  of  specialized  data  paths  such  as  those  be¬ 
tween  AE' s  and  those  between  the  various  special  registers  internal  to  each 
AE. 

The  in-out  capability  of  the  FDP  was  made  quite  limited  since  it  was 
expected  that  the  UNIVAC  1219  would  handle  much  of  the  required  I-O.  Ex¬ 
perience  with  implementation  of  an  actual  range -gated  Doppler  radar  has 
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indicated  that  a  slightly  more  sophisticated  1-0  system  would  be  desirable, 
and  such  a  system  will  be  included  in  the  ASP. 

The  LX-1  microprocessor  is  a. general-purpose  computer  whose  pres¬ 
ent  chief  application  is  to  display  processing;  it  was  not  designed  especially 
for  signal  processing.  However,  the  LX-1  is  an  inherently  simple  and  small 
machine  yet  has  some  features  which  are  quite  attractive  for  a  signal  pro¬ 
cessing  computer.  In  Fig.  2  the  LX-1  is  shown  to  contain  a  set  of  general 


Fig.  2.  LX-1  microprocessor. 

registers  R. ,  a  set  of  function  boxes  F.,  a  data  memory  M  ,  and  three 
X  X  s 

busses  A,  B,  and  D  which  interconnect  these  parts.  The  basic  data  word 

is  16  bits  long.  The  control  resides  in  the  program  memory  M  .  In  a 

P 

typical  function  instruction,  two  registers,  say  and  R^,  are  read  onto 
the  A  and  B  busses,  an  operation  such  as  multiplication  is  performed  in  one 
of  the  function  boxes,  and  the  result  is  written  from  the  D-bus  into  another 
general  register,  say  R,..  For  a  memory  instruction,  Mg  is  addressed  from 
the  contents  of  a  register  placed  on  the  B-bus,  and  reads  from  the  A- bus  or 
writes  onto  the  D-bus.  Machine  control  is  quite  simple,  since  all  instruc¬ 
tions  cause  data  to  flow  through  the  busses  in  a  similar  way  and  there  are 
no  specialized  path?  between  certain  special  registers.  Also  programming 
of  the  LX-1  is  quite  simple  because  it  is  a  serial  machine. 
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A  key  feature  of  the  LX-1,  which  differentiates  it  from  the  FDP  as 
well  as  from  standard  computers,  is  the  set  of  general  registers.  These 
genera"  registers  have  great  flexibility,  being  useful, for  example, as  index 
registers  or  arithmetic  accumulators.  The  fact  that  all  general  registers 
are  accessible  in  the  same  way  to  the  busses  and  the  function  boxes  serves 
to  minimize  the  data  shuffling  necessary  during  a  computation.  For  ex¬ 
ample,  in  an  FFT  butterfly  programmed  on  the  FDP  a  number  of  instruc¬ 
tions  must  be  devoted  to  shuffling  data  between  the  various  specialized  I,  Q, 
and  R  registers. 

The  LX-1  however,  is  significantly  slower  than  the  FDP  in  signal 
prot  essing  applications.  The  LX-1  has  fast  hardware,  but  since  it  lacks 
parallelism  is  only  about  one-fourth  as  fast  as  the  FDP  for  an  FFT  butterfly. 

B.  ASP  Architecture 


The  ASP  architecture  represents  a  synthesis  of  some  of  the  speed- 
producing  parallelism  of  the  FDP  into  n.i  LX-1  type  structure  featuring 
general  registers,  simplicity  of  arch":  dure  and  control  leading  to  a  small 
size  potential,  and  simplicity  of  programming.  A  new  line  of  hardware, 
faster  than  was  available  for  the  earlier  machines,  will  be  used. 

The  structure  of  the  ASP,  depicted  in  Fig.  3,  features  like  the  LX-1 


a  set  of  general  registers  M  ,  function  boxes,  a  data  memory  M  ,  a  bussing 

r  s 

structure,  and  a  program  memory  M  .  However,  several  key  departures 

P 

from  the  LX-1  are  to  be  noted.  The  busses  carry  24-bit  words  that  may  be 
separated  into  two  12-bit  bytes.  The  function  boxes  have  dual  sets  of  12-bit 


arithmetic  hardware  so  that,  for  example,  in  the  adder  function  box,  two 
simultaneous  12 -bit  adds  or  one  24-bit  add  can  be  carried  out  as  a  single 
instruction.  With  the  conf.guration  box,  which  allows  swapping  of  the  two 
12-bit  bytes  of  a  word,  and  an  inhibition  option,  which  allows  nullification 
of  either  of  the  two  dua<  operations,  completely  flexible  manipulation  of  the 
bytes  is  possible. 


The  ASP  will  have  64  24-bit  general  registers  (compared  to  16  for  the 
LX-1).  This  large  number  of  general  registers  provides  a  very  high-speed 
temporary  storage,  which  as  will  be  illustrated  later,  can  be  used  to  speed 
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Fig.  3.  Structure  of  the  Advanced  Signal  Processor. 

up  signal  processing  programs.  The  large  number  of  general  registers  are 

feasible  because  of  the  availability  of  very  fast  integrated  circuit  memories 

which  permit  the  general  registers  to  be  realized  as  a  memory,  rather  than 

as  a  set  of  separate  flip-flop  registers,  with  negligible  loss  in  speed.  Since 

two  operands  must  be  read  from  M  in  each  instruction,  the  two  physical 

P 

memories,  and  M*  ,  will  contain  identical  contents  and  be  read 
simultaneously. 

The  program  and  data  memories  are  each  1024  x  24  integrated  circuit 
memories.  Some  overlap  between  reading  of  M  and  execution  of  instruc- 

r 

tions  will  be  incorporated. 
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The  function  boxes  of  the  ASP  will  include:  (1)  an  adder -logic  com¬ 
plex  capable  of  performing  two  12-bit  or  one  24-bit  add,  subtract,  or  logic 
operation  per  instruction;  (2)  a  multiplier  box  containing  two  12  x  12  array- 
multipliers;  (3)  a  special  function  box  to  facilitate  shift  and  normalization 
operations;  and  (4)  an  array  divider.  Like  the  LX-1,  the  ASP  has  a  highly 
modular  structure  so  that  different  versions  of  the  machine  could  include 
new  function  boxes  or  leave  out  some  of  those  just  listed. 

The  in-out  system  of  the  ASP  will  include  a  pair  of  24-  nit  direct  mem¬ 
ory  access  channels  each  of  which  can  provide  data  flow  to  ^nd  from  Mg  in 
parallel  with  the  main  program.  There  will  be  six  additional  auxiliary  chan¬ 
nels  to  allow  control  signals  (but  not  data)  to  be  transmitted  to,  and  received 
from,  other  devices. 

III.  INSTRUCTION  REPERTOIRE 

The  main  instruction  classes  available  in  the  ASP  are  now  introduced. 
Instruction  word  formats  will  be  presented,  and  some  examples  of  particular 
instructions  and  their  execution  times  will  be  given.  A  detailed  listing  with 
definitions  of  the  instruction  set,  as  it  currently  stands,  is  provided  in  the 
appendix. 

A.  Arithmetic  and  Logical  ~|  18-6-14503 

Operations 

In  this  class  are  included  all 
the  instructions  which  are  executed 
in  the  adder-logic  function  box.  All 
arithmetic  in  the  ASP  is  2'  s  comple¬ 
ment.  Three  different  instruction 
formats  are  utilized  to  control  the 
adder-logic  functions,  as  indicated 
in  Fig.  4. 

In  the  3-field  instructions,  A 
selects  one  of  64  operands  from 
for  the  A  bus,  B  selects  one  of  64 
operands  for  the  B  bus,  and  D 
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Fig.  4.  Three  instruction  formats 
to  control  adder-logic  functions: 
(a)  3-field  format,  (b)  2-field  for¬ 
mat,  (c)  1 -field  format. 
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selects  one  of  64  destinations  in  for  the  result.  All  operands  are  24  bits, 
but  the  options  of  configuration  and  inhibition  allow  flexible  operation  on  12- 
bit  bytes.  For  example  a  typical  instruction  can  perform  the  operations: 

A  +  B.  -» D  ;  A.  +  B  -»  D. 
u  l  u  l  u  £ 

where  the  subscripts  u  and  f  refer  to  upper  and  lower  12-bit  bytes,  re¬ 
spectively.  Such  an  instruction,  consisting  of  two  12-bit  adds,  can  be  per¬ 
formed  in  65  nsec.  In  addition  to  various  manipulation  of  the  bytes,  some 
scaling  provision  is  included. 

The  pair  of  operations 

(1/2)  (Au  +  Bu)-*Du  ;  (1/2)  (Ai  + 

can  be  executed  in  65  nsec.  The  option  for  24-bit  arithmetic  is  included, 
and  the  24-bit  add 

At  B  -»  D 

can  be  executed  in  75  nsec.  Also  included  among  the  3-field  format  instruc¬ 
tions  are  the  bit-by-bit  logic  operations  AND,  XOR,  and  IOR,  each  of  which 
takes  65  nsec.  The  FDP  cycle  time  is  150  nsec  for  all  instructions  except 
the  multiply,  which  takes  450  nsec. 

In  the  2-field  format,  which  is  included  to  allow  more  option  codes 
than  would  be  otherwise  possible,  the  A-field  is  ostensibly  missing,  but.  the 
A  operand  is  taken  as  the  same  general  register  as  th^i  D  destination.  This 
group  consists  of  various  other  adder-subtractor  options  which  are  differen¬ 
tiated  according  to  configuration,  inhibition,  scaling,  and  single  or  double 
precision.  An  instruction  type  of  interest  is  the  sign  extended  add  which 
permits,  for  example,  to  be  sign  extended  to  24  bits  and  added  to  the  24- 
bit  A  operand.  Another  noteworthy  instruction  is  the  zero  inject  instruction, 
which  shifts  B^  right  one  place  and  unconditionally  forces  a  zero  into  the 
sign  bit.  This  instruction  is  quite  useful  in  programming  a  24  x  24  bd  mul¬ 
tiply.  Finally,  a  bit  reversed  add  instruction  similar  to  that  in  the  FDF ,  as 
included. 

The  1-field  format  is  used  for  operations  with  12-bit  constants  "y 
which  comprise  part  of  the  instruction  word.  The  constant  can  be  inserted 
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in  or  added  io  either  half  of  the  register  addressed  by  the  D-field. 

B.  Multiplication 

The  two  array  multipliers  in  the  multiplier  function  box  each  perform 
signed  12  x  12  bit  multiplies  yielding  24  bits  of  product.  All  multiply  in¬ 
structions  are  executed  in  120  nsec.  Various  options  are  provided  as  to  the 
configuration  of  the  input  bytes  and  the  possible  inhibition  of  writing  either 
of  the  two  multiplier  outputs.  Also  options  are  provided  to  select  those  bits 
of  the  24-bit  products  that  are  to  be  transmitted  to  the  two  12-bit  output  bytes. 

C.  Division 

The  divide  box  contains  an  array  divider  and  is  capable  of  dividing  a 
24-bit  dividend  by  a  12-bit  division  and  proving  a  12-bit  quotient  in  220  nsec. 

D.  Scaling 

The  scaling  functions  are  designed  to  be  used  in  conjunction  with  the 
multiplier  to  yield  efficient  programming  of  normalization  and  shifting.  For 
example,  to  left  justify  a  number,  one  would  use  the  scale  function  (SF) 
instruction  to  determine  the  necessary  number  of  places  to  shift,  the  scale 
factor  positive  (SFACP)  instruction  to  set  up  a  multiplier  to  effect  the  shift, 
and  a  multiply  instruction  to  actually  carry  out  the  shift.  The  entire  normali¬ 
zation  would  take  65  +  65  +  120  =  250  nsec  and  could  be  used  in  floating 
point  operations  as  well  as  in  block  normalization.  The  scaling  operations 
included  are  quite  simple  and  require  much  less  hardware  than  that  needed 
in  a  complete  shifting  matrix.  The  fast  multiply  permits  shifting  to  be  ac¬ 
complished  quite  'ickly  without  such  a  matrix. 

% 

E.  Memory 

The  memory  reference  instructions  have  the  2-field  format  of  Fig.  4b. 

The  B -field  points  to  the  M  location,  which  contains  the  M  address  of  in- 

r  s 

terest,  and  the  D-field  points  to  the  source  for  writing  or  the  data  destina¬ 
tion  for  reading.  The  various  memory  instruction  options  permit  the  Mg 
address  to  come  from  B^  or  B^  and  the  data  source  or  destination  to  be 
either  Du  or  D^  for  a  12-bit  transfer,  or  D  for  a  24-bit  transfer.  The  time 


for  a  memory  read  instruction,  which  includes  the  required  accesses  to 
a„nd  Mg,  is  100  nsec.  The  time  for  a  memory  write  instruction  is  80  nsec. 

F.  Branching 

The  branching  instructions  in  the  ASP  include  arithmetic  jumps,  over¬ 
flow  jumps,  unconditional  jumps,  a  jump  conditional  on  in-out  activity,  and 
a  skip  make  instruction. 

Arithmetic  jumps  are  conditional  on  the  contents  of  selected  reg¬ 
isters.  For  example,  one  may  jump  on  the  condition  that  the  upper  byte  of 
some  register  is  positive,  or  on  the  condition  that  the  full  24-bit  word  in 
a  specified  location  is  zero.  The  arithmetic  jumps  may  be  used  with  or 
without  a  skip  on  jump  (SOJ)  like  that  in  the  FDP.  If  the  SOJ  bit  in  the  jump 
instruction  word  is  not  set,  the  instruction  after  the  jump  will  be  executed 
even  if  the  jump  condition  is  met.  If  the  SOJ  bit  is  set,  the  instruction  after 
the  jump  will  be  nullified  if  the  jump  condition  is  met.  The  "SOJ  not  set" 
option  would  save  time  in  a  tight  loop  since  one  instruction  cycle  time  is  ef¬ 
fectively  lost  in  killing  the  next  instruction.  The  format  for  arithmetic 
jumps  is  indicated  in  Fig.  5. 

D  selects  the  register  to  be 
tested  and  B  selects  upper  or 
lower  byte;  y  selects  the  M 
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Fig.  5.  Format  for  arithmetic  jumps. 


location  to  be  jumped  to;  and 
a  specifies  the  SOJ  option. 

The  overflow  jumps  are  similar  to  the  arithmetic  jumps  except  that 
the  format  is  different  (since  a  D-field  is  not  needed)  and  the  jump  conditions 
are  the  various  types  of  overflow  that  can  result  from  arithmetic  operations. 
The  in-out  activity  jump  tests  various  activity  conditions  on  an  in-out  chan¬ 
nel.  The  skip  make  instruction  is  patterned  after  that  in  the  FDP  and  allows 
skipping  of  any  combination  of  the  next  four  instructions  according  to  the  con¬ 
ditio^  -  in  one  of  the  16  flags  in  the  ASP. 
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G .  Input- Output  and  Block  Transfer 

The  1-0  system  of  the  ASP  includes  two  24-bit  direct  memory  access 
data  channels  and  six  control  channels,  each  of  which  can  be  monitored  on  an 
interrupt  basis. 

Each  data  channel  provides  24  input  data  lines,  24  output  lines,  and 
several  control  lines  to  carry  request  signals  and  mode  information.  Data 
transfers  are  initiated  by  a  DMA  instruction,  which  transmits  to  the  1-0 
hardware  such  parameters  as  block  size  and  starting  address.  The  1-0 
channel  hardware  then  carries  out  all  operations  needed  to  effect  the  trans¬ 
fer,  slowing  down  the  main  program  only  when  both  require  access  to  Mg  at 
the  same  time.  When  the  buffer  is  complete  a  monitor  interrupt  (if  desired) 
then  causes  the  main  program  to  jump  to  a  service  routine  whose  location 
was  also  specified  in  the  DMA  instruction. 

The  six  control  channels  are  identical  to  the  data  channels  except  that 
the  data  lines  are  omitted.  The  control  signals  could  synchronize  the  ASP 
with  other  computers  in  a  real-time  system. 

The  block  transfer  instruction  (BLK)  transfers  a  list  of  words  from  M 

s 

into  Mp  and  is  quite  similar  to  the  corresponding  instruction  in  the  FDP. 

Like-  DMA,  BLK  must  specify  a  block  size  and  starting  addresses.  But  un¬ 
like  DMA,  the  BLK  causes  all  other  operations  to  cease  during  its  execution. 

IV.  PROGRAMMING  FEATURES 

Some  examples  of  the  ASP' s  programming  features: 

A.  Double  Precision  and  Floating  Point 

The  ASP  was  designed  so  that  24-bit,  fixed  point  arithmetic  could  be 
performed  quite  efficiently.  A  24-bit  add  or  subtract  is  performed  in  a 
single  75-nsec  instruction,  and  a  24-bit  memory  access  is  accomplished 
with  one  memory  instruction.  Of  course  the  machine' s  dual  parallelism  is 
lost  for  24-bit  operations.  A  24  x  24  bit  multiply  must  be  programmed. 

Using  the  formula 

AB  «  A  B  +  2~U  (A  B'  +  B  Aj  ), 
u  u  '  u  St,  u  x> 
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where  A£  or  B £  is  formed  by  shifting  the  lower  byte  of  A  or  B  right  one 
bit  and  forcing  the  sign  bit  to  zero  (the  ZIN.T  instruction),  a  result  accurate 
to  22  bits  can  be  obtained  in  six  instructions  dr  520  nsec* 

There  are  no  hardware  floating  point  instructions  on  the  ASP,  and 
floating  point  arithmetic  must  be  programmed.  However,  the  scaling  func¬ 
tions  mentioned  above  facilitate  the  shifting  and  normalizations  needed  for 
floating  point.  A  single  precision  (12-bit  fraction,  12-bit  exponent)  floating 
point  multiply  can  be  executed  in  about  0.  7  nsec,  while  a  double  precision 
(24-bit  fraction,  12-bit  exponent)  multiply  takes  about  1.  1  nsec.  Single 
precision  floating  add  takes  about  1.  6  nsec  while  double  precision  requires 
2.7  nsec.  These  times  seem  slow  in  comparison  to  fixed  point  operations, 
but  compare  favorably  with  other  computers.  For  example,  the  I3M  360 
Model  67,  which  has  hardware  floating  point  takes  about  5  nsec  for  a-  multi¬ 
ply  and  2.  5  nsec  for  an  add.  Standard  computers  without  floating  point  hard¬ 
ware  take  significantly  longer. 

B.  FFT  Butterfly 

The  basic  computation  in  an  FFT  is  the  so-called  butterfly  computa¬ 
tion  which,  as  indicated  j.n  Fig.  6,  operates  on  two  complex  numbers  to 
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a  + 


c  + 


+  (c  cos  0  —  d  sin  0)  +  j  lb  +  (d  cos  0  +  c  sin  0)] 


-  (c  cos  0  -  d  sin  0)  +  j  [b  -  (d  cos  0  +  c  sin  6)] 


Fig.  6.  Butterfly  computation:  FDP-- 10  instructions,  1.5  nsec; 
LX-1--30  instructions,  4.5  nsec;  ASP--12  instructions,  1.0  nsec. 
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yield  two  new  complex  numbers  and  requires  a  complex  multiply  and  two 
complex  adds.  A  standard  N-point  radix  2  FFT  requires  (N/2)  log^N  butter¬ 
fly  computations.  A  butterfly  can  be  programmed  on  the  ASP  with  12  instruc¬ 
tions,  including  4  memory  accesses,  3  add  instructions,  12  multiply  instruc¬ 
tions,  and  3  index  and  branch  instructions.  The  execution  time  is  about 
1. 0  psec.  For  comparison,  a  butterfly  on  the  FDP,  as  programmed  for  the 
demonstration  radar,  takes  10  instructions  or  1.5  (xsec. 

Thus  the  ASP  will  be  somewhat  faster  than  the  FDP  for  standard  FFT 
programs.  This  speed-up  comes  chiefly  from  the  faster  instruction  execu¬ 
tion  resulting  from  the  faster  hardware,  and  the  fact  that  12-bit  operations 
take  less  time  than  18-bit  operations. 


C.  Radix  8  FFT 


The  foregoing  discussion  indicated  the  speed  of  the  ASP  in  carrying 
out  a  standard  radix  2  FFT.  By  means  of  slightly  more  sophisticated  FFT 
programming,  advantage  can  be  taken  of  the  large  number  of  fast  general 


registers  to  achieve  significant  speed-up. 

The  technique  can  be  illustrated  by 
the  example  of  a  64-point  FFT,  pro¬ 


grammed  in  radix  8.  The  input  data  in 

r  f  f  f 

Mg  is  thought  of  as  organized  in  a  two-  o  8  '  *  ’  56 

dimensional  array  as  depicted  in  Fig.  7. 

f  f  f 

The  FFT  is  begun  by  bringing  the  first  1  9  57 

row  into  and  computing  an  8 -point 

FFT  of  this  row  without  additional  ac- 

cess  to  M  .  The  8-point  FFT  is  imple- 
s 

merited  as  efficiently  as  possible;  for 

example,  when  the  coefficient  e^  in  the  7  15  63 


butterfly  is  1  or  j  (more  than  half  the 
cases),  no  multiplications  are  executed. 
Each  of  the  eight  outputs,  is  multiplied 
by  a  complex  twiddle  factor,  and  the 


Fig.  7.  64-point  FFT,  radix  8. 
Computational  steps:  (1)  eight 
8-point  discrete  Fourier  trans¬ 
forms  (DFT)  on  rows,  (2)  twiddle 
factors  (64  complex  multiplies), 
(3)  eight  8-point  DFTs  on  columns. 
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results  are  stored  back  in  place  in  Mg.  This  procedure  is  repeated  for  all 
the  rows.  Then  each  column  is  brought  into  Mf,  transformed  (no  additional 
twiddle  factors  are  necessary),  and  stored  back  in  Mg.  This  completes  the 
64 -point  FFT. 

In  this  implementation  of  a  64-point  FFT,  only  two  exchanges  of  the 
array  between  data  memory  and  the  general  registers  are  necessary.  This 
saves  two-thirds  of  the  memory  access  time  of  a  radix  2  algorithm,  which 
requires  log^  64=  6  such  exchanges.  Also  the  8-point  FFTs  may  be  coded 
more  efficiently  by  eliminating  unnecessary  multiplications.  The  result  is 
that,  with  radix  8,  a  64-point  FFT  can  be  computed  in  about  60%  of  the  time 
necessary  for  a  radix  2  program.  This  saving  is  possible  only  because  there 
are  enough  general  registers  to  provide  all  the  necessary  storage  for  an  8- 
point  FFT. 

This  technique  can  be  extended  to  FFTs  of  other  sizes,  and  implemen¬ 
tation  with  other  radixes.  Also  the  general  technique  of  using  as  high 
speed  temporary  storage  can  speed  up  a  wide  variety  of  programs. 

D.  Large  FFT  with  External  Core  Storage 

The  high-speed  data  memory  of  the  ASP  will  be  initially  limited  to  1024 
words  because  of  size  and  cost  considerations.  However,  it  is  often  desired 
to  perform  an  FFT  where  the  number  of  samples  is  too  large  to  be  accommo¬ 
dated  in  Mg,  and  it  would  be  advantageous  if  such  a  transform  could  be  car¬ 
ried  out  with  only  small  speed  loss  caused  by  shuffling  the  data  in  and  out  of 
an  external  core  memory.  The  direct  memory  access  capability  of  the  ASP 
makes  this  possible.  The  technique  will  now  be  illustrated  by  a  2048-point 
FFT  example. 

Consider  the  data  (stored  sequentially  in  core)  as  a  two-dimensional, 

32  x  64  array  where  the  rows  consist  of  samples  spaced  by  32  sampling  inter¬ 
vals  and  the  columns  contain  sequential  samples.  A  64-point  FFT  on  each 
row  is  computed,  and  the  results  are  multiplied  element-by-element  by  a  set 
of  complex  constants  (called  twiddle  factors,  and  which  are  also  stored  in 
core)  and  stored  back  in  core.  Then  32-point  FFTs  on  each  column  are. 
performed  and  the  computation  is  complete.  The  transform  will  be  ordered 
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in  core  with  rows  and  columns  interchanged. 

While  the  processor  is  computing  the  FFT  of  a  row,  the  next  row  of 
data  and  twiddle  factors  is  flowing  into  data  memory  and  the  last  computed 
row  is  being  sent  to  core  by  means  of  direct  memory  access  block  transfers 
which  are  controlled  by  in-out  hardware  and  slow  down  the  FFT  computation 
only  negligibly.  Associated  with  the  core  memory  must  be  an  address  box, 
which,  after  initiation  from  the  processor,  can  sequence  thi’ough  an  arbi¬ 
trary  number  of  core  locations  with  an  aibitrary  spacing  between  locations. 

With  the  scheme  just  sketched  out,  the  2048-point  transform  can  be 
computed  essentially  as  fast  as  if  sufficient  fast  memory  were  available  to 
store  the  entire  array. 

V.  HARDWARE  FEATURES 

This  section  describes  the  processor1  s  logical  design  and  its  method 
of  construction  as  now  envisioned.  Changes  can  be  expected  as  the  design 
progresses. 

The  treatment  begins  by  explaining  why  MECL  10K  integrated  circuit 
logic  units  were  selected  for  building  the  processor.  The  general  registers, 
function  boxes,  timing,  input-output,  remote  console,  and  construction  are 
then  described.  The  processor*  s  control  circuitry  is  not  yet  defined. 

A  detailed  description  of  the  processor*  s  instructions  is  given  in  the 
Appendix  and  a  familiarity  with  them  is  assumed. 

A.  MECL  10K 

The  basic  ground  rules  for  choosing  an  integrated  circuit  logic  line  for 
the  processor  were  that  using  it  we  could  produce  a  machine  with  an  instruc¬ 
tion  cycle  time  less  than  100  nsec  that  could  be  packaged  in  a  6-ft  relay  rack. 
The  machine*  s  speed  and  physical  size  were  estimated  by  studying  designs 

of  multipliers,  general  registers,  and  memories.  Three  logic  lines  were 

2 

considered:  Schottky  T  L,  1-nsec  Emitter  Coupled  Logic  (ECL),  and  2-nsec 
2 

ECL.  Schottky  T  L  was  eliminated  because  of  speed;  its  gate  delay  is  3  nsec 
which  is  50  percent  slower  than  the  2-nsec  ECL.  The  1-nsec  ECL  line, 
Motorola  MECL  III  has  a  limited  number  of  logic  functions,  must  be 
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;  packaged  on  multilayer  boards,  and  its  gates  dissipate  almost  twice  the 

power  of  2-nsec  ECL  circuits.  The  2-nsec  ECL  was  selected. 

/ 

Two  2-nsec  ECL  lines  are  commercially  available  at  this  time: 
Motorola  MECL  10K  and  Fairchild  9500.  Both  lines  are  equal  in  speed  and 
contain,  or  will  contain,  equivalent  circuit  functions.  Tentatively,  the  new 
processor  will  use  Motorola  MECL  10K  because:  (1)  Requires  less  power, 
30-  vs  75-mw/gate  when  driving  a  2-KO  load.  (2)  Compatible  output  voltage 
levels  with  the  voltage  levels  of  Advanced  Memory  Systems  (AMS)  memory 
element  over  the  temperature  range  0°  to  70°C.  The  processor*  s  memo¬ 
ries  and  general  registers  are  to  be  built  from  the  AMS  circuit.  (3)  Four- 
bit  arithmetic  logic  element,  vital  for  array  multipliers  and  dividers,  is 
currently  available  in  quantity. 

! 

j  Some  of  the  MECL  10K  line*  s  more  important  features  are: 

(1)  2-nsec  propagation  delay  and  3-nsec  rise  and  fall  times 

1  The  3-nsec  rise  and  fall  times  are  slow  enough  to  permit  un¬ 

terminated  lines  up  to  4  in.  long  without  worrying  about  reflec¬ 
tions.  The  1.  2-nsec  rise  and  fall  times  of  MECL  III  restrict 
line  lengths  to  less  than  1  1/4  in. 

(2)  50  fi  drive  capability 

For  lines  longer  than  4  in. where  reflections  are  a  problem,  the 

I 

lines  can  be  terminated  in  50  or  greater  to  negate  reflections. 

(3)  Balanced  twisted  pair  line  interface 

Signals  can  be  transmitted  and  received  over  balanced  twisted 
!  pair,  a  good  way  to  distribute  the  system  clock  because  it  is 

easy  to  control  the  transmission  delays  by  changing  line  lengths. 

(4)  Compatibility  with  MECL  II  and  III 

Compatibility  with  high  speed  MECL  III  and  slow  speed,  4-nsec 
i  propagation  delay  MECL  II  provides  the  MECL  10K  line  added 

flexibility. 

A  semiconductor  memory  is  necessary  to  realize  the  proposed  ma¬ 
chine.  Unfortunately,  neither  Motorola  nor  Fairchild  have  an  off-the-shelf 
memory  element  whose  speed  is  compatible  with  the  rest  of  the  circuits  in 
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their  respective  lines.  Fortunately,  Advanced  Memory  Systems  (AMS)  is 
producing  a  64-bit  memory  element  that  is  compatible  with  MECL  19K, 
though  not  compatible  with  Fairchild  9500.  The  element  is  organized  as  64 
words,  1  bit  per  word,  and  has  a  7-nsec  read  time  and  a  7-nsec  write  time. 

MECL  10K  was  selected  instead  of  Fairchild  9500  because  of  the  avail¬ 
ability  of  MECL  4-bit  arithmetic  logic  circuits  and  compatibility  with  the 
AMS  memory  circuits. 

B.  General  Registers 

Reviewing  how  the  processor*  s  structure  works  (Fig.  8),  assume  that 
a  12-bit  add  instruction  has  just  been  transferred  into  the  instruction  regis¬ 
ter  from  the  program  memory.  The  6-bit  A  address  selects  a  24-bit  gen¬ 
eral  register  whose  output  is  transferred  onto  the  A  bus,  and  the  6-bit  B 
address  selects  a  second  24-bit  general  register  whose  output  is  transferred 
onto  the  B  bus.  The  12 -bit  add  is  executed  in  the  adder  function  box,  using 
the  A  and  B  operands  which  are  available  at  its  inputs.  The  result  is  stored 
in  the  general  register  specified  by  the  6-bit  D  address. 

The  whole  operation  has  three  distinct  parts:  (1)  read  the  general 
registers,  (2)  execute  instruction  in  a  function  box,  and  (3)  write  result  back 
into  general  registers. 

It  is  apparent  from  this  review  that  an  instruction*  s  speed  is  highly 
dependent  on  how  fast  the  general  registers  can  be  read  and  written.  Even 
if  an  add  or  multiply  could  be  executed  in  zero  time,  6  complete  add  or 
multiply  instruction  would  still  require  time  to  read  and  write  the  general 
registers. 

The  general  registers  can  be  realized  in  two  ways.  In  both  designs, 
the  A  and  B  operands  will  be  read  from  the  general  registers  simultaneously 
instead  of  serially  to  increase  the  speed  at  which  instructions  can  be 
executed. 

The  logic  needed  to  build  the  general  registers  from  flip-flops  (FF) 
and  gates  is  indicated  in  Fig.  8b.  This  design  has  two  major  problems 
besides  requiring  64  separate  FF  registers:  (1)  One  bit  of  the  bus  is  ob¬ 
tained  by  multiplexing  together  64  FF  outputs,  and  this  must  be  done  for 


each  of  the  24  A  and  24  B  bus  bits;  (2)  Each  of  the  24  D  bus  lines  must  be 
distributed  to  64  loads. 

This  design  would  use  over  2000  MECL  10K  packages  —  an  excessive 
number  for  a  small  machine. 

The  general  registers  (Fig.  8c)  for  this  processor  contain  two  64- 
word,  24  bits/word  image  memories,  which  word-for-word  always  contain 
identical  data.  The  A  operand  is  read  from  the  A  image,  and  simultaneously, 
the  B  operand  is  read  from  the  B  image.  Both  operands  are  stored  in  bus 
registers  before  sent  to  the  function  boxes. 

Results  are  written  into  the  two  memories  by  storing  them  temporar¬ 
ily  in  the  D  register.  When  the  memories  are  not  busy,  e.g. ,  when  an  add 
or  multiply  is  actually  being  performed  in  a  function  box,  the  contents  of  the 
D  register  can  be  written  into  both  memories.  The  write  operation  does  not 
affect  the  other  function  boxes  because  they  are  isolated  from  the  memories 
by  the  bus  registers.  This  method  of  writing  the  memories  permits  "bury¬ 
ing"  or  hiding  the  time  needed  to  write  them,  a  minimum  of  7-nsec  of  mem¬ 
ory  element  write  time. 

It  takes  150  integrated  circuits  that  include  48  64-bit  memory  elements 
to  build  these  general  registers,  which  is  considerably  less  costly  than  the 
previous  solution  (Fig.  8b). 

When  the  Next  Instruction  Pulse  (NI  Pulse)  is  generated  by  the  pro¬ 
cessor'  s  control  circuitry,  which  is  not  shown  in  Fig.  9,  a  ne'  v  instruction 
is  transferred  into  the  instruction  register.  Simultaneously,  contents  of  the 
D  bus,  which  is  the  result  of  the  last  instruction  and  may  or  may  not  have 
meaning,  are  transferred  into  both  the  DA  and  DB  registers.  Also,  the  6- 
bit  D  address  portion  of  the  instruction  register,  which  specifies  the  address 
at  which  the  contents  of  the  D  bus  will  be  stored  in  the  image  memories,  is 
transferred  to  the  Store  Address  Register. 

Data  are  now  read  from  the  A  and  B  image  memories  by  addressing 
them  with  the  new  A  and  B  addresses,  which  are  in  the  instruction  register. 

In  parallel,  the  two  addresses  are  compared  with  the  store  address.  If 
either  address  is  equal  to  the  Store  Address,  and  if  the  last  instruction 
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Fig.  9.  General  register  memory. 


produced  a  result  that  must  be  stored  in  the  general  register  memory,  a 
one  level  is  produced  at  the  appropriate  comparator  output  indicating  that 
the  needed  operand  is  stored  in  the  DA  and  DB  registers  and  has,  as  yet, 
not  been  written  into  the  image  memories.  If  neither  comparator  output  is 
a  one,  the  outputs  of  the  image  memories  are  switched  through  the  bus  in¬ 
put  gating  circuitry  and  transferred  into  the  bus  registers  when  the  control 
logic  generates  a  bus  pulse.  When  an  operand  address  is  equal  to  the  store 
address,  the  appropriate  address  comparator  output  will  be  a  one  and  this 
will  switch  the  correct  data  storage  register,  DA  or  DB,  through  the  bus  in¬ 
put  gating  logic  and  into  the  bus  register  when  the  bus  puise  occurs. 

The  outputs  of  the  bus  registers  go  to  all  function  boxes  and  they  are 
transformed  in  the  particular  function  box,  which  is  specified  by  the  opera¬ 
tion  code  of  the  current  instruction.  In  parallel,  data  in  the  DA  and  DB 
registers  are  written  into  the  image  memories,  which  are  not  now  involved 
in  the  function  box  operations  at  the  location  specified  by  the  Store  Address 
Register,  wp^  and  wp^  are  the  memory  write  commands;  wp^  initiates  a 
write  operation  in  the  lower  12  bits  of  a  storage  location,  and  wpu  initiates 
a  write  operation  in  the  upper  12  bits.  If  the  last  instruction  produced  a  24- 
bit  result,  both  wp^  and  wp^  will  be  enabled.  If  the  last  instruction  produced 
a  12-bit  result,  either  wp^  or  wp^  will  be  enabled;  the  choice  between  wp^ 
or  wp^  depends  on  whether  the  12 -bit  result  appears  in  the  lower  or  the  up¬ 
per  half  of  the  24 -bit  word.  If  the  result  is  in  the  lower  12  bits  of  the  word, 
wp^  is  enabled;  if  it  is  in  the  upper  12  bits,  wp^  is  enabled. 

Read  time  is  defined  as  that  interval  which  begins  when  new  data  are 
transferred  into  the  instruction  register  and  which  ends  when  the  A  and  B 
operands  arrive  at  the  function  box  inputs.  Thus  read  time  for  the  image 
memory  realization  is  30  nsec. 


The  component  propagation  delays  for  the  logic  in  Fig.  9,  that  result 
in  a  30-nsec  read  time,  as  defined,  are  shown  in  Fig.  10,  a  simplified 
general  register  timing  diagram.  The  actual  time  to  read  the  image  mem¬ 
ories,  "  nsec,  is  less  than  25  percent  of  the  30-nsec  general  register  read 
time. 
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t  =  0 


30  nsec 


15  nsec 


-  3  nsec:  SETTLING  TIME  FOR  THE*  INSTRUCTION,  DA,  OB  8  STORE  ADDRESS  REGISTERS 

-  5  nsec:  DISTRIBUTE  INSTRUCTION  REGISTER  SIGNALS 
5  nsec:  ADDRESS  IMAGE  MEMORIES 

|-  7  nsec:  READ  IMAGE  MEMORIES 

5  nsec:  INPUT  GATING  8  SETTLING  TIME  FOR  A  8  B  BUS  REGISTERS 
5  nsec:  A  8  B  BUS  DISTRIBUTION 
-WRITE  IMAGE  MEMORIES 


Fig.  10.  General  register  memory  timing. 


C.  ALU  Function  Box 

The  arithmetic  logic  unit  [ALU)  function  box  contains  logic  to  imple¬ 
ment  the  following  types  of  instructions: 

(1)  Single  12-bit,  and  double  24-bit  precision  additions  and 
subtractions 

(2)  Double -precision  logical  operations 

(3)  Modification  of  general  register  by  constants 

(4)  Special  functions:  bit-reversed  add,  scale  function,  scale 
factor  positive,  scale  factor  negative,  zero  inject 

(5)  Branching. 

These  instructions  are  explained  in  detail  in  the  Appendix  and  in  Section  H. 

The  logic  needed  to  implement  these  instructions  except  for  the  special 
scale  functions  can  be  realized  with  a  versatile  adder-subtractor-logic  unit, 
which  has  two  nearly  identical  halves  called  12-bit  adders  (Fig.  11).  The 
basic  adder  element  is  the  MC  10181,  the  4-bit  ALU.  Two  4-bit  numbers 
and  a  carry  are  entered  into  each  ALU  element  and  four  sum  bits  and  a 
carryout  as  well  as  some  irnp^~tant  auxiliary  functions  are  produced.  When 
the  A  and  B  inputs  to  an  ALU  element  change,  it  takes  7  nsec  for  the  ele¬ 
ment'  s  sum  outputs  to  stabilize  and  5.4  nsec  for  its  carry  output  to  stabilize. 
When  an  element' s  carry  input  changes  while  its  A  and  B  inputs  are  static, 
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Fig.  11.  12 -bit  adder. 


it  takes  5  nsec  for  the  element*  s  sum  outputs  to  settle  down  and  3.  1  nsec 
for  its  carry  output  to  settle  down.  Using  the  ALU,  it  takes  13.  5  nsec  to 
add  two  12 -bit  numbers. 

Although  not  shown  in  Fig.  11,  each  ALU  has  five  control  inputs, 
which  choose  which  one  of  32  possible  functions  (16  logical,  16  arithmetic) 
the  ALU  will  perform.  Examples  of  the  logical  functions  are  the  logical 
product,  A  •  B,  the  logical  "Or",  A+B,  A  itself,  or  B  itself.  Examples  of 
the  arithmetic  functions,  besides  addition  and  subtraction,  are  the  function 
2A  and  A  plus  A  •  B.  All  of  these  functions  are  performed  in  a  time  equal 
to  or  less  than  an  add. 

Double-precision  operations  are  performed  in  a  24-bit  adder,  built  by 
interconnecting  two  12-bit  adders.  The  input  to  the  second  12-bit  adder  is 


Cj2>  the  carry  out  of  the  first  12-bit  adde  r.  C12  is  generated  in  7.  3  nsec 
in  the  fast  carry  circuit  (Fig.  12b)  instead  of  taking  it  directly  from  the 
carry  out  of  the  first  12-bit  adder.  The  fast  carry  circuit  uses  the  ALU 


element*  s  P^  and  functions,  which  are  defined  in  Fig.  12a.  It  takes 
5  and  3  nsec  for  and  P^,,  respectively,  to  stabilize  after  an  input  vari¬ 
able  change.  The  complete  24-bi't  add  requires  20.7  nsec:  7.3  nsec  to 


generate  C^,  2.2  nsec  to  gate  into  the  carry  input  of  the  first  stage  of 
the  second  adder,  and  11.2  nsec  for  the  second  adder  to  produce  its  12-bit 


result. 
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Fig.  12.  Carry  look  ahead. 

Both  adders  have  input  and  output  gating  that  adds  an  additional  8.  3 
nsec  to  the  time  it  .takes  for  data  to  flow  through  the  adder  function  box. 
Thus,  all  12-bit  operations  will  take  21.8  nsec  or  less,  and  all  24-bit 
operations  will  take  2,9.  0  nsec  or  less. 


D.  Timing 

A  simplified  timing  diagram  for  two  consecutive  add  instructions 
(Fig-  13)  will  aid  calculation  of  the  time  to  execute  12-bit  add  instructions. 


t«o  ~r 


+  30  nsec  READ  MR 


(UM<S10| 


21.3  n»ec  EXECUTE  12-BIT  ADO 


_  5  nsec  0-8US  DISTRIBUTION 
15%  SAFETY  FACTOR 


+  30  nice  READ  MR 


T"  EXECUTE  12-BIT  A00 


_5n«c  0-BUS  DISTRIBUTION 
15%  SAFETY  FACTOR 


I5nwc  WRITE  MR 


60  nstc  REAO 

PROGPAM 

MEMORY 


15  nsec  WRITE  MR 


60  nsec  REAO 

PROGRAM 

MEMORY 


G5  nut 


1 65  nsec 


Fig.  13.  12 -bit  add  instruction  timing. 


At  t  =  0  the  first  add  instruction  is  transferred  into  the  instruction  register 
from  the  program  memory,  and  the  program  address  register  that  had  pre¬ 
viously  contained  the  address  P  is  incremented  by  1  so  that  its  new  value 
is  P  +  1.  From  the  section  on  general  registers,  it  takes  30  nsec  to  read 
the  two  operands  from  the  general  registers  and  to  transfer  them  to  the  ALU 
function  box.  It  then  takes  21.8  nsec  to  add  the  operands  and  an  extra  5  nsec 
to  send  the  result  from  the  function  box  back  to  the  general  registers  via  the 
D  bus.  The  sum  of  these  three  times  is  56.  6  nsec.  A  15  percent  safety 
factor  is  added  to  cover  delay  variations  due  to  temperature  changes,  power 
supply  variations,  and  noise  giving  a  total  of  65  nsec  when  rounded  to  the 
nearest  5-nsec  increment. 
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While  the  first  add  is  executed,  the  next  instruction  (the  second  addi¬ 
tion)  is  read  from  address  P  +  1  of  the  program  memory.  A  program  mem¬ 
ory  read,  as  explained  in  the  next  section,  requires  60  nsec;  this  is  the 
interval  beginning  when  a  new  address  is  clocked  into  the  program  memory 
address  register  and  ending  when  a  new  instruction  arrives  at  the  input  of 
the  instruction  register.  The  second  add  instruction  is,  therefore,  avail¬ 
able  at  the  input  to  the  instruction  register  when  the  first  add  is  complete, 
and  a  new  add  instruction  is  begun  by  transferring  the  new  instruction  into 
the  instruction  register.  Simultaneously,  the  12-bit  result  of  the  first  add 
is  clocked  into  the  DA  and  DB  registers  (Fig.  9)  from  which  it  will  be  writ¬ 
ten  into  the  general  registers  after  operands  for  the  current  add  instruction 
are  read  from  the  general  registers.  It  is  easy  to  see  that  the  second  add 
instruction  and  all  subsequent  add  instructions  taken  from  concurrent  pro¬ 
gram  memory  locations  are  executed  in  65  nsec. 

When  executing  24-bit  additions,  the  timing  diagram  (Fig.  13)  re¬ 
mains  unchanged  except  that  the  addition  time  changes  from  21.8  to  29.  0 
nsec.  There  is  a  corresponding  change  in  the  15  percent  safety  factor  re¬ 
sulting  in  a  75-nsec,  24-bit  add,  instruction  time. 

In  general,  the  time  required  to  complete  any  of  the  processor*  s  in¬ 
structions  has  four  components:  (1)  30  nsec  to  read  the  general  registers; 
this  time  increment  is  included  in  all  instructions  even  those  few  for  which 
it  is  not  required  such  as  JPS,  an  unconditional  jump;  (2)  X  nsec  to  perform 
an  operation  in  a  function  box;  (3)  5  nsec  to  transmit  a  result  from  a  func¬ 
tion  box  back  to  the  general  registers,  if  this  is  required  by  the  instruction; 
and  (4)  a  15  percent  safety  factor. 

There  are  two  exceptions  to  this  rule:  (1)  when  the  computed  instruc¬ 
tion  time  is  less  than  65  nsec,  and  (2)  when  certain  program  jumps  are 
performed. 

Some  instructions  such  as  12-  and  24-bit  logic  function  require  less 
time  than  a  12 -bit  addition  because  no  carry  has  to  propagate  through  the 
adder.  Other  instructions  such  as  JPS  require  no  more  than  10  or  15  nsec 
to  execute.  Unfortunately,  this  speed  cannot  be  taken  advantage  of  because 
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the  read  of  the  next  instruction  from  the  program  memory  takes  60  nsec, 
which  is  only  5  nsec  less  than  the  65  nsec  needed  for  a  12 -bit  add-  The 
speed  of  these  fast  instructions  could  be  increased  by  5  nsec,  but  there  is 
little  gain  in  doing  so. 

When  a  jump  instruction  located  at  address  P  in  the  program  memory 
is  executed,  there  is  the  option,  under  the  control  of  the  instruction*  s  a  bit, 
and  explained  in  the  Appendix,  to  skip  or  execute  the  instruction  located  at 
address  P  +  1.  This  assumes,  of  course,  that  the  jump  instruction  com¬ 
mands  that  the  program  branch  to  a  location  Z  not  equal  to  P  +  1.  If  the 
program  is  going  to  skip  the  instruction  following  the  jump,  then  the  pro¬ 
cessor  cannot  use  the  instruction  which  was  read  out  of  the  program  mem¬ 
ory  while  the  jump  was  in  progress  and  must  wait  for  the  new  instruction 
located  at  address  Z  to  be  read.  This  requires  at  least  60  nsec;  waiting 
65  nsec  is  proposed.  The  net  effect  is  that  this  type  of  jump  takes  an  addi¬ 
tional  65  nsec.  On  the  other  hand,  if  the  instruction  located  at  address 
P  +  1  is  performed,  no  extra  time  is  needed. 

E.  4-Quadrant  Array  Multiplier 

The  ASP  will  have  two  4-quadrant  array  multipliers  that  may  be  oper¬ 
ated  separately  or  in  parallel  at  the  behest  of  the  programmer.  Each  multi¬ 
plier  function  box  is  comprised  of  a  network  of  interconnected  4-bit  ALUs, 
specifically,  the  Motorola  MC  10181  ALU  package.  A  given  multiplier  will 
accept  two  signed,  12-bit,  2*  s  complement  operands,  one  from  the  A  bus 
(upper  or  lower  byte)  and  one  from  the  B  bus  (upper  or  lower  byte).  The 
output  (product)  consists  of  24  bits,  the  two  most  significant  of  which  are 
considered  sign  bits.  These  are  always  equal  except  when  squaring  the 
largest  negative  number.  All  24  bits  of  product  may  be  placed  on  the  D  bus, 
if  desired.  If  both  operands  are  considered  integers,  only  bits  1-12  of  the 
product  are  retrieved  and  placed  on  either  the  upper  or  lower  D-bus  byte, 
depending  on  the  multiplier  in  question.  If  the  operands  are  considered  to 
be  binary  fractions  (binary  point  to  the  right  of  the  sign  bit), then  the  product 
is  considered  to  be  a  fraction  with  the  binary  point  to  the  right  of  the  least 
significant  of  the  two  sign  bits.  Thus  bits  12  through  23  of  the  output  are 
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retrieved  and  placedon  either  D-bus  byte,  so  that  the  product  can  be  con¬ 
sidered  a  binary  fraction  represented  in  exactly  the  same  fashion  as  the 
operands. 

The  overflow  flags  associated  with  the  upper  and  lower  D-bus  bytes 
can  be  set  by  the  multipliers  depending  on  the  destination  of  the  opted  prod¬ 
uct  bits.  The  rules  governing  overfloware: 

(1)  An  integer  multiply  will  set  the  appropriate  overflow  flag  if 
bits  12  through  24  of  the  product  are  not  identical.  This  implies 
that  the  product  is  not  representable  in  12  bits. 

(2)  A  fraction  multiply  will  set  the  appropriate  overflow  flag  if 
bits  23  and  24  of  the  product  (the  two  nominal  sign  bits)  are  not 
identical. 

(3)  A  multiply  involving  a  24-bit  product  transfer  cannot  set  the 
overflow  flag  on  the  lower  D-bus  byte,  but  will  set  the  upper  byte 
flag  if  bits  23  and  24  of  the  product  are  not  identical. 

Overflow  conditions  may  be  tested  via  the  overflow  jump  (JOV) 
instruction. 

The  operation  of  the  multipliers  is  most  easily  visualized  by  under¬ 
standing  2' s  complement  number  representation  where  a  number  is  defined: 

N-l  N'2  i 

X=  -X  •  2  +  S  X.  •  21 

i=  0  1 

Here,  an  N-bit  binary  word  is  considered  to  be  the  sum  of  two  polynomials, 
one  negative  and  one  positive.  Xg,  the  binary  coefficient  of  2^~*  is  the 
sign  bit.  The  binary  coefficients  X,,  are  the  rest  of  the  bits  of  the  word. 

A  signed  N  bit  by  N-bit  arithmetic  product  may  be  written  as  follows  in 
terms  of  this  definition: 


v  •  is  multiplication,  +  is  addition,  -  is  subtraction. 
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r  N~2  .1 

r  N-2  -i 

-X  •  2N_1  +  2  X.  •  21 

•  -Y  •  2N_1  +  2  Y.  •  2j 

L  s  i=o  1  J 

L  s  j*o  J  J 

2N-2  N 
Z  =  X  •  Y  *  2™  +  21N 

—  s  s 


r  N-2  .  N-2 

1  Y  •  (  -2  X.  •  21)  +  X  -(-2  Y.-  Zh 

L  S  i=0  1  S  j=0  J  J 


N-2  N-2 

+  2  2  X.  *  Y.  •  21+j 

j=0  i=0  1  J 

This  expression  can  be  further  rewritten  by  observing  a  simple  impli¬ 
cation  of  the  2’ s  complement  definition:  the  sign  of  a  given  number  may  be 
cha  nged  by  complementing  all  coefficients  and  then  adding  1.  Mathematically 
speaking 


N-2 

N-2 

2 

X.  •  21  = 

(  2 

i=0 

1 

i=0 

N-2 

Y.  •  2)  = 

N-2 

2 

(  2 

3=0 

J 

o 

n 

■' — j 

Using  Eqs.  (3)  and  (4)  to  rewrite  Eq.  (2) 


Z  =  X  *  Y  •  22N"2  +  2N" 
s  s 


i  r  n"2  _ 

Y  •  2  X.  '  21 

s  .  .  1 

L  i=o 


N-2  _ 

+  X  ’  2  Y.  •  2J 

S  j=°  J  • 


,  N-2  N-2 

+  (X  +  Y  )  •  21N"1  +  2  2  X.  ■  Y.  •  2 J  .  (5) 

s  s  i=0  j=0  1  J 

Four-quadrant  (all  sign  options)  multiplication  thus  seems  to  involve  a  series 
of  coefficient  additions  with  proper  weighting  conventions  observed.  Given 
that  X^  and  Yj  are  binary  digits,  their  arithmetic  product  is  simply  a  logical 
product  (AND)' 

X.  •  Y.  =  X.  n  Y. 

1  J  1  J 


*  A  OB  is  logical  "AND,  "  AUB  is  logical  inclusive  "OR.  " 
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Thus  Eq.  (5)  can  be  rewritten  once  more  as 


2N-2  N-l 

Z=  (X  OY  )•  c  +  21N 
—  s  s 


N-2  _  .  N-2  _ 

Y  0  2  X  •  21  +  X  0  2  Y.  •  2J 
s  .  ,  x  s  j=Q  3 


i=0 


N-2  N-2 


-r  (X  +Y)'2N_1+  2  2  (X.OY.)  •  21+J 

s  s  ^_0  1-3 


(6) 


Notice  that  the  last  term  of  Eq.  (6)  can  be  expanded  in  the  form 


N-2  N-2  .  .  N-2  .  N-2  . 

2  2  (X.OY.)  *  21+3  =  YnO  2  X.  •  21  +  2Y,  0  2  X.  •  2 

.  A  .  n  1  J  0  .  1  l.nl 

j =0  1=0  J  1=0  1=0 


+ 


+  2 


N-2 


N-2 

Ym  ,0  2  X.  •  21 
N-2  .  A  i 


(7) 


If  the  multiplicand  is  assumed  to  be  the  binary  word 

2  =  Xs  XN-2  XN-1 . X1  X0  (N  bits) 

and  the  multiplier  is  assumed  to  be  the  binary  word 


Y  =  Ys  YN-2  YN- 1 


Yj  Yq  (N  bits) 


then  Eqs.  (6)  and  (7)  lead  to  Fig.  14.  Here  is  illustrated  the  array  of 
weighted  multiplicand  coefficients  to  be  conditionally  summed,  depending 
on  the  multiplier  coefficients.  Notice  how  Eq.  (7)  is  implemented  in  the 
upper  11  rows  of  the  array.  The  first  three  terms  of  Eq.  (6)  are  incor¬ 
porated  as  the  bottom  rows  of  the  array. 

There  are  any  of  a  number  of  ways  to  effect  the  actual  summing  of  the 
entities  in  this  array,  some  optimized  for  speed,  others  to  conserve  hard¬ 
ware.  One  obvious  way  is  to  explicitly  form  all  the  logical  products 
X^OYj  (called  partial  products)  and  do  a  straightforward  addition  of  the 
resulting  partial  product  array  as  it  stands.  Carry  and  sum  paths  can 
be  arranged  to  optimize  speed  performance  with  regard  to  the  relative 
carry  and  sum  delays  inherent  in  the  adder  elements  used. 
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Fig.  14.  Basic  array  of  coefficients  to  be  summed  for  4-quadrant  multiply. 


ADDER 


Fig.  15.  Grouping  of  coefficient  array  for  parallel 


Another  method  of  summing  the  partial  product  array  is  to  group  the 
rows  in  pairs  and  add  them  separately,  but  in  parallel.  The  results  of  the 
first  stage  of  adds  are  also  grouped  into  pairs  and  in  turn  added.  The  pro¬ 
cess  continues  until  all  partial  sums  have  been  combined  to  yield  the  de¬ 
sired  product.  In  general,  if  there  are  N  multiplier  bits  (including  sign) 
the  number  of  adder  stages  necessary  is  given  by 

S  =  log^(N  +  1),  '"'unded  to  next  highest  integer, 


which  includes  the  extra  rows  due  to  sign  correction.  For  N  =  12,  as  in  the 
ASP,  the  number  of  stages  necessary  is  4.  Figure  15  shows  the  coefficient 
array  for  the  ASP  case  grouped  for  parallel  summing.  This  method  is  some¬ 
times  called  the  "binary  tree"  algorithm. 

Irrespective  of  the  actual  summing  mechanism  used  for  the  partial 
product  array,  it  should  be  noticed  that  some  simplification  of  the  rows  in¬ 
volving  and  Yj  can  be  effected.  Theoretically  these  rows  represent  nega¬ 
tive  entities  and  thus  must  be  assigned  sign  bits  equal  to  1.  In  order  to  in¬ 
corporate  them  correctly  into  the  summing  operation,  the  sign  bits  must  be 
extended  as  far  as  is  necessary  to  derive  the  requisite  number  of  product 
bits.  The  sign  extension  is  clear  in  Figs.  14  and  15.  When  the  partial 
product  array  is  formed,  the  "south  west"  corner  of  the  array  appears  as 
in  Fig.  16a.  Some  Boolean  algebraic  manipulations  show  that  the  right  most 
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Fig.  16.  Simplification  of  high  order  end  of  coefficient  array. 


column  can  be  reduced  to  produce  the  situation  shown  in  Fig.  16b.  The 
center  column  can  be  similarly  reduced  giving  rise  to  Fig.  16c.  Clearly, 
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the  process  could  be  extended  indefinitely.  The  net  result  is  that  the  sign 

bits  may  be  dropped  off  the  X^  and  Yj  rows  if  the  columns  containing  Xg0  Yg 

and  beyond  are  replaced  by  X  U  Y  .  Clearly,  this  process  need  only  con- 

s  s 

tinue  as  far  as  necessary  to  produce  the  last  product  bit,  for  the  ASP 

case.  The  proof  is  as  follows: 

Using  an  adder  unit,  3  bits  of  equal  weight  can  be  reduced  to  a  net  sum 
and  a  carry: 


SUM 


=  A 


B  ©  a;  (®  =  exclusive  "OR") 


CARRY  =  (AOB)  U(A  ©  B)OC. 


For  the  case  at  hand,  let 


X 


=  A 
=  B 


X  HY  =  C. 

S  S  X 


then 


SUM 


=  X  ©  Y 
s  s 


(X  riY  )  =  X  u  Y 

S  S  S  S 


CARRY  =  (X  HY  )  U  (X 
s  s  s 


y  )n(x  hy  )  =  x  n y 

s  s  s  s  s 


Therefore  the  sum  of  X  ,  Y  and  X  0  Y  reduces  to  a  sum  equal  to  X  U  Y 

s  s  s  s  s  s 

and  a  carry  into  the  next  column  equal  to  X  0  Y  .  The  next  column  is  now 

s  s 

identical  to  the  first  and  the  process  is  repeated.  Clearly,  this  can  continue 
ad  infinitum. 

The  actual  algorithm  implemented  for  the  ASP  multipliers  is  basically 
of  the  tree  type  and  requires  four  adder  stages.  However,  it  is  not  neces¬ 
sary  to  explicitly  form  the  partial  product  array  due  to  the  nature  of  the  ad¬ 
der  element  used.  The  MC  10181  is  a  programmable  ALU  in  that  it  can  be 
made  to  perform  myriad  operations  on  the  input  operands  in  response  to 
commands  from  control,  or  programming  inputs.  In  the  multiplier,  the 
first  stage  of  units  is  controlled  by  pairs  of  multiplier  bits,  the  other  stages 
are  hard  wired  as  adders.  The  inputs  to  the  first  stage  are  the  multiplicand 


bits,  arranged  for  appropriate  weighting.  The  10181  package  can  be  caused 
to  add  its  2  operand  inputs,  or  gate  either  (cr  neither)  through  singly. 

These  operations  are  all  that  are  necessary  to,  in  effect,  form  and  combine 
the  partial  products. 

Figure  17  illustrates  the  grouping  of  a  coefficient  array  for  a  signed, 

6  by  6  multiply,  as  an  example.  Three  stages  of  adders  are  necessary. 
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Fig.  17.  Evolution  of  interconnections  for  6  by  6  multiplier  adder  array. 


The  grouping  and  combining  of  partial  sums  is  depicted  as  the  process 
evolves  from  right  to  left.  The  attendant  hardware  realization  is  shown  in 
Fig.  18.  The  various  stages  and  intermediate  variables  labelled  reference 
Fig.  17.  Notice  the  manner  in  which  the  multiplicand  (X)  is  distributed  to 
the  first  stage.  The  A  input  is  equal  to  X,  the  B  input  is  equal  to  X  left 
shifted,  one  place  or  2  X.  Thus  the  relative  weighting  of  subsequent  rows 
in  the  array  is  preserved.  The  appropriate  relative  weighting  of  all  partial 
sums  is  observed  when  combining  them  in  the  subsequent  stages.  Notice 
also  that  the  control  rules  for  each  first  stage  unit  are  included  in  Fig.  18. 
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Extension  to  the  full  12  by  12  bit  case  is  somewhat  complex,  but  con¬ 
ceptually  straightforward.  Figure  19  shows  the  basic  arrangement  of  10181 
units  for  this  case. 

A  minor  modification  of  the  array  illustrated  in  Fig.  19  can  be  shown 
to  require  only  38  of  the  10181  packs,  and  about  25  assorted  16-pin  support 
logic  packs.  The  basic  multiplier,  exclusive  of  control  setup  and  operand 
shuffling  overhead,  is  expected  to  operate  in  42  nsec. 

F.  4-Quadrant  Array  Divider 

The  ASP  divider  function  box  is  comprised  of  a  combinational  array  of 
adder  and  subtractor  logic  elements.  The  network  accepts  a  24-bit  word 
from  the  A  bus  as  a  dividend  (or  numerator).  The  word  is  interpreted  as  a 
signed,  2*  s  complement  entity  with  one  sign  bit  and  23  information  bits.  The 
divisor  (or  denominator)  is  a  12-bit  word  that  may  come  from  either  the  up¬ 
per  or  lower  byte  of  the  B  bus.  It  is  interpreted  as  a  signed,  2' s  comple¬ 
ment  number  consisting  of  one  sign  bit  and  11  information  bits.  The  array 
produces  a  12-bit  quotient  and  a  12-bit  remainder,  both  consisting  of  one 
sign  and  11  data  bits.  The  quotient  is  entered  on  the  upper  byte  of  the  D  bus, 
the  remainder  on  the  lower  byte.  The  divisor  and  dividend  may  be  con¬ 
sidered  to  be  integer,  fractional,  or  mixed  numbers.  In  most  instances, 
however,  it  seems  reasonable  that  the  entities  will  be  considered  to  be  frac¬ 
tions  with  the  binary  point  situated  to  the  right  of  the  sign  bit.  The  divider 
overflow  logic  is  designed  to  be  most  consistent  with  this  interpretation. 

The  underlying  operating  principle  of  the  array  is  that  of  nonrestoring 
binary  division.  The  procedure  is  most  easily  understood  by  considering  the 
divisor  and  dividend  to  be  positive  fractional  quantities  wherein  the  divisor 
is  larger  than  or  equal  to  the  dividend.  The  quotient  will  be  a  positive  frac¬ 
tion  in  this  instance.  To  obtain  the  first  quotient  data  bit,  a  trial  divisor 
equal  to  half  the  actual  divisor  is  subtracted  from  the  dividend  yielding  a 
partial  dividend.  If  the  partial  dividend  is  positive,  then  a  1  is  entered  as 
the  quotient  bit.  If  negative,  however,  the  trial  divisor  did  not  "go  into"  the 
dividend  and  a  0  must  be  entered  as  the  quotient  bit.  In  normal  (restoring) 
division  it  would  be  necessary  at  this  point  to  add  the  trial  divisor  to  the 
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partial  dividend.  This  would  restore  the  original  dividend  before  an  attempt 
is  made  to  subtract  a  new  trial  divisor.  Nonrestoring  division  makes  use  of 
the  fact  that  each  successive  trial  divisor  is  half  the  preceding  one.  Thus 
the  addition  (or  restoration)  of  a  trial  divisor  followed  by  subtraction  of  one 
half  that  very  same  trial  divisor  is  nothing  more  than  a  net  addition  of  half 
the  trial  divisor.  Returning  to  the  example,  if  the  first  data  bit  of  the  quo¬ 
tient  is  a  1,  the  previous  trial  divisor  is  halved  and  subtracted  from  the  par¬ 
tial  dividend  to  produce  the  next  quotient  bit.  If  the  first  quotient  bit  is  a  0, 
the  trial  divisor  is  halved  and  added  to  the  partial  dividend  to  produce  the 
next  quotient  bit.  This  procedure  continues  until  all  desired  quotient  data 
bits  have  been  produced.  The  algorithm  has  the  distinct  advantage  of  being 
realizable  as  an  unclocked  array.  There  are  no  feedback  loops;  the  process 
flows  unconditionally  from  beginning  to  end  without  any  "back-up"  steps. 

Realizing  this  division  procedure  in  practice,  for  the  4-quadrant  case 
(all  sign  combinations  of. divisor  and  dividend  possible),  requires  some 
manipulation.  The  heart  of  the  divider  array  consists  of  a  series  of  adder/ 
subtractor  stages.  Each  of  the  stages  will  either  add  or  subtract  the  appro¬ 
priate  divisor  from  the  appropriate  partial  dividend  depending  on  the  sign  bit 
of  the  partial  dividend  in  question,  and  the  sign  bit  of  the  divisor  proper. 

The  array  will  accept  any  combination  of  dividend  and  divisor  signs.  How¬ 
ever,  the  set  of  quotient  data  bits  produced  by  the  array  must  be  corrected 
at  the  end  for  certain  sign  combinations. 

The  topmost  portion  of  a  diagram  for  the  procedure  (Fig.  20)  depicts 
generation  of  the  quotient  data  bits.  The  bottom  section  depicts  the  end  cor¬ 
rection.  The  rules  for  generating  a  quotient  bit  at  any  given  stage  of  the 
array  are: 

(1)  If  the  present  partial  dividend  is  positive,  enter  a  1  as  the 
concomitant  quotient  bit.  If  not,  enter  a  zero. 

(2)  If  the  divisor  and  the  present  partial  dividend  have  the  same 
sign,  subtract  the  next  trial  divisor. 

(3)  If  the  divisor  and  the  present  partial  dividend  have  differing 
signs,  add  the  next  trial  divisor. 


I 


I 


Fig.  20.  Conceptual  flow  diagram  of  nonrestoring  divide. 
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The  above  conventions  imply  two  interesting  facts:  First,  a  positive 
dividend  with  either  a  positive  or  a  negative  divisor  will  yield  the  proper 
quotient  magnitude .  Second,  a  negative  dividend  with  either  a  positive  or 
a  negative  divisor  will  yield  the  complement  of  the  magnitude  of  the  quotient. 
A  correction  based  on  the  actual  signs  of  the  operands  must  be  applied  to  de¬ 
rive  a  correct  2*  s  complement  quotient  representation.  For  example,  sup¬ 
pose  the  divisor  is  positive  and  the  numerator  is  negative.  The  quotient  bits 
generated  will  turn  out  to  be  a  1*  s  complement  representation  of  the  correct 
negative  result.  Thus  the  result  must  be  incremented  by  1  to  yield  the  proper 
2*  s  complement  representation.  As  a  further  example,  suppose  both  the 
divisor  and  dividend  are  negative.  Clearly  the  quotient  ought  to  be  positive. 
The  array  produces,  however,  the  complement  of  the  correct  results  and  a 
pure  inversion  of  the  quotient  bits  is  necessary.  All  of  these  cases  are 
dealt  with  via  an  extra  adder /subtractor  stage  at  the  very  bottom  of  the  ar¬ 
ray  which  performs  the  actual  correction.  Only  one  case  needs  no  correc¬ 
tion:  positive  divisor  and  positive  dividend.  The  correction  rules  are: 

(1)  if  both  the  divisor  and  dividend  are  positive,  assign  the 
quotient  sign  bit  the  value  0  and  do  nothing  to  the  quotient  data 
bits. 

(2)  If  both  the  divisor  and  dividend  are  negative,  assign  the 
quotient  sign  bit  the  value  0  and  complement  the  quotient  data 
bits. 

(3)  If  the  divisor  is  positive  and  the  dividend  negative,  assign 
the  sign  bit  the  value  1  and  increment  the  quotient  data  bits  by  1. 

(4)  If  the  divisor  is  negative  and  the  dividend  is  positive,  assign 
the  sign  bit  the  value  1.  Complement  the  quotient  data  bits  and 
increment  by  1 . 

Conceptually,  all  array  additions  and  subtractions  involving  an  N  bit 
signed  divisor  can  be  carried  out  on  an  N  +  1  bit  basis.  The  difference  (or 
sum)  between  any  given  partial  dividena  and  its  trial  divisor  should  also  be 
representable  in  no  more  than  N  bits  (really  N  -  1  bits  plus  sign).  Thus, 
for  this  case,  bits  12  and  13  of  any  partial  dividend,  being  both  presumably 
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sign  bits,  ought  always  to  be  in  agreement.  If  not,  an  overflow  indication 
is  rendered.  This  indication  implies,  in  terms  of  fractional  operands,  that 
the  dividend  was  greater  in  magnitude  than  the  divisor.  The  condition  is  il¬ 
legal  because  the  quotient  would  have  to  be  greater  than  1  and,  hence,  could 
not  be  represented  as  a  signed  fraction.  Since  the  quotient  appears  on  D^, 
the  overflow  flag  for  that  byte  is  set. 

The  overflow  condition  can  also  be  interpreted  in  the  context  of  integer 

operands.  In  such  an  instance  an  overflow  will  occur  if  the  magnitude  of  the 

12 

dividend  is  greater  than  2  times  the  magnitude  of  the  divisor.  In  such  an 

12 

instance  the  quotient  would  be  on  the  order  of  2  which  cannot  be  repre¬ 
sented  in  11  data  bits  plus  sign. 

If  operating  with  mixed  operands,  it  would  be  incumbent  upon  the  pro¬ 
grammer  to  ascertain  the  implied  overflow  conditions  appropriate  to  his  own 
representation  conventions. 

Note  that  to  generate  the  N-l  bit  quotient  and  a  sign  from  an  N-l  bit 
divisor  plus  sign,  only  2(N-1)  bits  of  dividend  plus  a  sign  are  necessary. 
This  implies  that  the  least  significant  bit  of  the  24-bit  dividend  operand 
(A-bus  input),  never  enters  the  calculation  of  the  quotient. 

The  partial  dividend  that  determined  the  last  quotient  bit  (i.  e.  ,  the  re¬ 
sult  of  the  last  add/subtract  stage)  is  considered  to  be  the  remainder.  It  is 
a  12-bit  entity  (11  bits  plus  sign)  and  is  related  to  the  other  operands  by  the 
equation: 

DIVIDEND  =  (QUOTIENT)  X  (DIVISOR)  +  REMAINDER  . 

It  can  be  used  in  conjunction  with  more  dividend  bits  to  derive  an  ex¬ 
tended  precision  quotient.  The  actual  formation  of  the  extended  dividend  it> 
involved,  but  it  can  be  done  and  the  signed  remainder  is  necessary. 

Figure  21  shows  an  actual  hardware  realization  of  a  divider  that  ac¬ 
cepts  a  4-bit  divisor  and  an  8-bit  dividend  yielding  a  4-bit  quotient  and  a  4- 
bit  remainder.  The  realization  can  be  extended  in  a  straightforward  manner 
to  the  24-bit/12-bit  case.  The  adder/subtractors  represented  can  be  real¬ 
ized  with  the  MECL  OK,  4-bit,  ALU  package  (MC  10181).  The  unit  can  be 
programmed  to  either  add  or  subtract  in  response  to  a  control.  The  actual 
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Fig.  21.  4  by  8  nonrestbring  divider  array 

with  overflow  detection  and  end  sign  correction. 


subtraction  is  accomplished  by  changing  the  sign  of  the  B  input  and  adding. 
This  operation,  in  effect,  requires  that  the  B  input  be  inverted  and  incre¬ 
mented  by  1.  The  package  does  the  complementation  internally  but  the  1 
must  be  supplied  at  the  C.  (carry  "in"  0)  input  wherever  a  subtraction  is  to 
occur. 

It  might  be  expected  that  since  the  divisor  is  a  4-bit  entity,  all  add/ 
subtracts  ought  to  be  done  on  a  5-bit  basis  as  was  inferred  earlier.  It  can 

St 

be  shown  via  some  manipulation,  that  the  N+  1 —  bit  can  be  simply  realized 

th 

as  nothing  more  than  the  carry  out  of  the  N —  bit  with  a  slight  change  in 

rules.  The  simplified  rules  now  can  be  stated  succinctly: 
s  t 

(1)  1  Stage  -  If  the  sign  bits  of  the  divisor  and  dividend  differ, 
then  add.  If  not,  subtract. 

(2)  All  Subsequent  Stages  -  If  the  carry  out  (C^)  of  the  N  bit 
of  the  previous  adder /subtractor  is  a  1,  enter  1  as  the  quotient 
digit.  Also,  if  the  carry  out  is  different  from  the  divisor  sign 
bit,  set  the  present  stage  to  subtract.  Add  otherwise. 

(3)  Overflow  -  If  the  carry  out  of  the  N  "  bit  for  any  given 

th 

stage  is  the  same  as  the  N  bit  out  of  that  stage,  signal  an 
overflow. 

(4)  End  Correction  -  can  best  be  summarized  in  tabular  form: 
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(sD) 
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Quotient 

Correc¬ 
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Logic  equations  are  easily  derived: 

SQ  =  Ci =  SN  ® SD 
SUB  =  SD 
ADD  =  SD 

These  are  seen  as  the  controls  implemented  in  the  figure  for  the  end  cor¬ 
rection  stage.  Notice  that  the  "A"  input  to  this  stage  is  necessarily  a  hard¬ 
wired  zero. 

The  actual  24-bit/l2-bit  divider  is  realized  using  3  2,  12-bit  stages. 

The  first  11  derive  the  quotient  bits,  the  last  does  the  correction.  Each 
stage  requires  three  MC  10181  packages  with  a  look-ahead  carry  generator 
arranged  to  feed  bit  9-  Thus  36  MC  10181  units  are  required.  Each  stage 
is  capable  of  producing  all  necessary  partial  dividend  bits  in  13  nsec. 
Therefore  the  entire  operation  will  require  12  x  13  =  156  nsec.  The 
number  of  packages  required  to  synthesize  controls,  overflow  functions, 
and  perform  data  distribution,  is  incidental.  Thus,  the  entire  unit  is  smaller 
in  terms  of  packages  than  the  multiplier  function  box.  The  actual  net  divide 
instruction  execution  time  will  be  greater  than  156  nsec  due  to  overhead  as¬ 
sociated  with  control  decoding,  operand  fetch,  and  deposition  of  the  quotient. 

G.  Square  Root  Function  Box 

The  ASP  square  root  function  box,  an  optional  extra  feature,  is  com¬ 
prised  of  a  combinatorial  array  of  adder/ subtractor  logic  in  much  the  same 
manner  as  the  divider  function  box.  The  input  to  the  array  consists  of  a 
signed,  24-bit,  2' s  complement  number  from  the  A  bus.  It  is  interpreted 
as  a  positive  fraction,  the  binary  point  situated  to  the  right  of  the  sign  bit. 

The  output  is  a  12-bit,  positive,  2' s  complement  fraction  that  is  placed  on 
the  upper  byte  of  the  D  bus.  If  the  input  should  happen  to  be  negative,  a 
fault  condition  is  signalled  by  setting  the  overflow  flag  associated  with  the 
upper  byte  of  D.  No  remainder  is  provided  since  normally  more  than  12 
bits  are  necessary  to  properly  represent  it. 
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In  analogous  fashion  to  the  divider,  the  square  root  algorithm  used  is 
one  that  lends  itself  to  realization  as  an  unclocked,  combinatorial  array  of 

3 

logic.  The  procedure  is  termed  the  nonrestoring  square  root  algorithm. 
The  array  is  built  up  as  a  series  of  adder /subtractor  stages.  No  back-up 
steps  are  required;  the  process  moves  irrevocably  forward  from  start  to 
finish. 

To  see  how  the  procedure  evolves,  assume  a  10-bit  radicand,  R,  of 
the  form: 

R  =  0.  R 

The  root  is  to  be  expressed  as  a  positive  fraction  of  the  form 
F  =  0.fjf2f3f4 

In  straightforward  fashion,  a  series  of  tests  can  be  tabulated,  which  should 
be  performed  on  R: 


(1) 

Is  R  >  (.  if? 

If  yes,  f =  1 

;  otherwise  f^ 

=  0  . 

(2) 

Is  R  ^(.fjl)2? 

If  yes,  f2  =  1 

;  otherwise  f2 

=  0  . 

(3) 

Is  R  >  (.  f1f2l  )2? 

If  yes,  fg  =  1 

;  otherwise  f3 

=  0  . 

(4) 

Is  R  >  (.  f1f2f3 1  )2  ? 

If  yes,  f4  =  1 

;  otherwise  f^ 

=  0  . 

Implicit  in  the  foregoing  is  the  undesirable  process  of  squaring  trial  radi- 
cands.  By  some  manipulation  these  tests  can  be  arranged  in  a  more  manage¬ 
able  form.  Notice  the  following: 

(.l)2  =  .01 

(.fjl)2  =  (.fj  +.01)2  =  .fx2  +  .0001  +  .  0£1  =  .f22  +  .OfjOl 

(.f1f2l)2  =  (.£1fz  +  .001)2  =  (.f1f2)2  +  .  000001  +  .00fjf2  =  (fxf  )2  + 
similarly  *  •  OOfj^Ol 

(•V2V)2=(-W3)2  +-000W301 

etc. 
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Clearly,  the  following  tests  may  be  substituted  for  the  originals: 

(1)  Is  R  >  (.  I)2?  If  yes,  f ^  =  1 ;  otherwise  f^  =  0 

(2)  Is  R  -  (.  fj  )2  >  .  Of^Ol  ?  If  yes,  f^  =  1;  otherwise  f^  =  0 

(3)  Is  R  -  (.f^)2  >  •  00f1f201?  If  yes,  f^  =  1;  otherwise  f  3  =  0 

(4)  Is  R  -  (*^^2^3^-  •  OOOf^fgOI  9  If  yes,  =  1;  otherwise  f ^  =  0 

It  would  appear  that  some  squaring  is  still  necessary.  However,  the 
squared  terms  can  be  easily  formed.  For  simplicity,  define  the  partial 
radicands  and  associated  test  values  as  follows: 

R0  =  R  «0  =  •  01 

Rj  =  R  -  (.fp2  ffj  =  .  OfjOl 

R2  =R  -(.flf2)2  «2*-00ftf201 

R3  =  R  -  (.fjf^)2  ,  a3  =  .  000fjf2f301 

Now  the  following  set  of  observations  can  be  made: 


R,  if  fj  =  0 

R  -  .  01,  if  f1  =  1 
R  -  (.f^2  =  Rj,  if  f2  =  0 

R  -  (.f^)2  =  R1  -  .  0^01  =  Rj  -  qtj,  if  f2  =  1 
R  -  (•f1f2)2  =  R2,  if  f3  =  0 

R  -  (.f1f2i)2  =  r2  -  .oof1f2oi  =  r2  -  a  ,  if  f  =  1 


The  pattern  seems  well  established  and  the  procedure  for  obtaining  the  ith 
root  bit  can  be  stated  succinctly:  subtract  or.  ^  from  R.  If  the  result  is 
positive,  enter  f.  =  1.  If  not,  enter  f.  =  0  and  add  back  (restore)  a{ 

Clearly  R^  =  R^  ^  -  g\  ^  if  T  =  l,  or  R.  =  R.  ^  if  T  =  0.  Thus  the  process 
can  continue  until  all  desired  f  bits  have  been  extracted. 
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Clearly,  the  following  tests  may  be  substituted  for  the  originals: 

(1)  Is  R  >  (.  I)2?  If  yes,  f^  =  1;  otherwise  f^  =  0. 

(2)  Is  R  -  (.fj)^>  .OfjOl?  If  yes,  f  =  1;  otherwise  f^  =  0. 

(3)  Is  R  -  (.f1f2)2  >  •  OOf^Ol?  If  yes,  f^  =  1;  otherwise  f  3  =  0. 

(4)  Is  R  -  (.  fjf  f  )2  >  •  000f1f2f301  •>  If  yes,  f^l;  otherwise  f4  =  0. 

It  would  appear  that  some  squaring  is  still  necessary.  However,  the 
squared  terms  can  be  easily  formed.  For  simplicity,  define  the  partial 
radicands  and  associated  test  values  as  follows: 

R0  =  R  «0  =  •  01 

Rj  =  R  -  (.1  j)2  =  .  OfjOl 

R2=R-(-V2l2  «2  =  .00(^01 

R3  =R  -  (.fjf2f3)2  ,  ff3  ss  .OOOfjf^Ol 

Now  the  following  set  of  observations  can  be  made: 

R,  if  fj  =  0 

R  -  .  01,  if  f  =  1 
R  -  (.  fx)2  =  Rj,  if  f2  =  0 
R  -  (.  f  x  1  )2  =  Rj  -  .  OfjOl  =  Rj  -  ffj,  if  f2  =  1 

R  -  (.fjf2)2  =  R2,  if  f3  =  0 

R  -  (.  fj^l)2  =  R2  -  .  OOfj^Ol  =  R2  -  a  ,  if  f3  =  1 

The  pattern  seems  well  established  and  the  procedure  for  obtaining  the  ith 
root  bit  can  be  stated  succinctly:  subtract  o\  ^  from  R^  j.  If  the  result  is 
positive,  enter  f.  =  1.  If  not,  enter  f .  =  0  and  add  back  (restore)  a.  ,. 
Clearly  R^  =  R^  ^  -  a.  j  if  R  =  1,  or  R^  =  R^  ^  if  R  =  0.  Thus  the  process 
can  continue  until  all  desired  f  bits  have  been  extracted. 
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Step 

Number 

If  R.>  0, 

Ri+i=Ri-(  ) 

If  R.  <  0, 

Ri+l=Ri  +  (  ) 

Root  Status 

1 

.  01 

X 

0. 

2 

.  V  101 

.  0011 

O.f 

3 

.  OOfjlOl 

.  OOfjOll 

°*  f lf 2 

4 

.  OOOf^lOl 

.  OOOfjf  Oil 

°-flf2f3 

5 

.  OOOOf^fglOl 

.  OOOOfjf^Oll 

°* flf2f3f4 

• 

• 

The  process  has  been  reduced  to  one  of  subtracts  or  adds  of  the  ap¬ 
propriate  test  values  (a.).  The  test  values  to  be  added  or  subtracted  at  any- 
given  stage  are  in  fact  identical  except  for  the  second  and  third  from  least 
significant  bit .  Whether  an  add  or  a  subtract  is  to  occur  at  any  given  stage 
is  wholly  a  function  of  che  sign  of  the  partial  radicand  (IT  )  at  that  point. 

Figure  22  depicts  a  conceptual  flow  chart  of  a  nonrestoring  realization 
of  the  example  posed  earlier.  Figure  23  illustrates  a  hardware  formulation 
based  on  adder /subtracter  elements.  The  specific  case  shown  is  one  of  an 
8 -bit  radicand.  It  should  be  clear  that  to  generate  N  bits  of  root,  only  2N 
bits  of  radicand  are  necessary.  This  implies,  in  like  fashion  to  the  division 
case,  that  the  least  significant  bit  of  the  radicand  never  enters  the  calcula¬ 
tion.  In  the  case  of  the  ASP,  only  11  bits  of  root  (plus  a  sign)  are  necessary. 
Hence  only  the  ZZ_  bits  of  radicand  after  the  binary  point  are  used. 

Figure  24  is  a  practical  hardware  realization  of  the  case  shown  in 
Fig.  23,  using  the  MC  10181  ALU  unit.  It  can  be  shown,  through  detailed 
manipulation,  that  the  lengths  of  the  adder/subtracter  stages  can  be  abbre¬ 
viated  somewhat  to  conserve  logic  (specifically  MC  10181s).  In  the  case  of 
the  ASP  realization  this  savings  is  sizable.  The  ASP  square  root  function 


Fig.  22.  Conceptual  flow  chart  of  the  extraction  of  four 
bits  of  square  root  via  the  nonrestoring  algorithm. 
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alization  of  8 -bit  nonrestoring  square  root  algorithm. 


box  is  realizable  with  22  of  the  MC  10181  units,  approximately  half  those 
necessary  for  a  multiplier  or  a  divider.  The  heart  of  the  array  should 
operate  somewhere  in  the  vicinity  of  TOO  nsec  if  look-ahead  carry  blocks 
are  used  in  the  last  four  stages.  There  is,  of  course,  the  additional  fixed 
overhead  delay  of  operand  fetch,  op  code  setup,  and  the  like. 


Fig.  24.  Equivalent  realization  of  8-bit  square  root. 


H.  Special  Functions 

The  ASP  has  been  provided  with  several  special  function  instructions 
to  facilitate  the  programming  of  scaling  operations,  floating  point  arithmetic, 
double-precision  multiplication,  and  bit-reversed  counting  (for  FFT  imple¬ 
mentation).  These  features  are  included  in  the  arithmetic/logic  function 
box  hardware  but  are  sufficiently  specialized  to  be  discussed  separately. 


.  1.  Bit-Reverseu  Add 

The  bit-reversed  add  (BRA)  requires  that  the  lower  byte  of  a  speci¬ 
fied  general  register  be  bit-reversed  (bit  1  and  bit  12  interchanged,  bit  2 
and  bit  11  interchanged,  etc.  )  and  added  to  the  lower  byte  of  a  second  gen¬ 
eral  register.  The  carry  is  to  propagate  from  bit  12  to  bit  1  (left  to  right) 
and  the  sum  replacing  the  contents  of  the  lower  byte  of  the  second  general 
register.  This  task  is  most  easily  effected  by  bit  reversing  the  contents  of 
the  second  general  register,  performing  a  normal  add,  and  then  bit  revers¬ 
ing  the  sum: 


A  +  BRV  (B) 


=  BRV 


'Carry  Left 
to  Right 


BRV  (A)  +  B 


Carry  Right 
to  Left 


The  operation  requires  some  additional  gating  on  the  inputs  and  out¬ 
put  of  the  adder. 


2.  Zero  Inject  (ZINJ) 

This  operation  involves  simply  a  right  shift  of  the  contents  of  the 
lower  byte  of  a  selected  general  register.  However,  bit  12  does  not  re¬ 
circulate  as  is  the  case  in  normal,  signed  right  shifts.  A  zero  is  uncon¬ 
ditionally  shifted  into  bit  12.  Implementation  involves  the  normal  shifting 
hardware  with  a  special  inhibit  on  the  bit  12  recirculate  loop.  The  shift  is 
necessary  to  kill  interference  from  the  sign  bit  when  combining  cross  prod¬ 
ucts  in  a  programmed,  double -precision  multiply. 

3.  Scale  Function  (SF) 

The  scale  function  operation  yields  a  positive,  12-bit  number  whose 
magnitude  is  equal  to  one  less  than  the  number  of  leading  Is  or  Os  in  the 
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contents  of  the  upper  byte  of  a  selected  general  register.  This  entity  corre¬ 
sponds  to  the  number  of  left  shifts  that  would  be  required  to  normalize  the 
contents  of  the  selected  general  register.  If  converting  to  floating  point, 
the  negative  of  the  scale  function  output  corresponds  to  the  actual  associated 
exponent  of  the  shifted  quantity.  If  operating  in  floating  point,  the  scale 
function  output  must  be  subtracted  from  the  exponent  of  the  entity  to  be 
shifted  to  yield  net  exponent  of  the  normalized  result. 

The  hardware  necessary  to  perform  the  SF  operation  is  shown  in 
Figs.  25  and  26.  Figure  25  depicts  a  network  which  accepts  11  bits  of  input 
and  produces  a  series  of  10  outputs.  The  number  of  the  outputs  in  the  "true" 
state  corresponds  to  the  number  of  left  shifts  necessary  to  normalize  the 
number  (x  =  x^x^.  •  •  •  x,.).  The  network  of  Fig.  26  is  simply  an  interconnec¬ 
tion  of  full  adders  (FA)  to  sum  the  number  of  Is  in  the  output  of  the  previous 
network.  Only  four  outputs  are  produced  since  the  maximum  number  of 
shifts  that  can  be  required  is  10  =  12g  which  is  representable  in  four  bits. 
Bits  5  -  12  of  the  output  are  always  zero.  The  hardware  necessary  to  real¬ 
ize  the  SF  operation  involves  only  about  a  dozen  IC  packages. 

4.  Positive  Scale  Factor  (SFACP) 

The  SFACP  operation  involves  the  transformation  of  a  4-bit  number  N, 
into  a  12-bit  number,  2*\  A  subsequent  integer  multiply  of  2^,  and  the  con¬ 
tents  of  a  selected  general  register  byte,  will  result  in  a  net  left  shift  of  the 
selected  quantity  N  places.  This  permits  use  of  the  multipliers  in  perform¬ 
ing  shift  operations.  The  shifts  might  be  involved  as  part  of  normalizing  or 
straight  scaling  operations.  For  example,  three  steps  are  involved  in 
normalizing : 

(1)  SF  find  number  of  left  shifts  necessary 

(2)  SFACP  translate  N -*■  2^ 

(3)  MUL  do  actual  N  place  left  shift  with  an  integer  multiply. 

The  4-bit  input  is  actually  the  low  order  four  bits  of  the  appropriate 
12-bit  general  register  byte. 
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f  The  hardware  realization  of  the  SFACP  operation  is  depicted  in  Fig. 

]'  27.  It  accepts  four  input  bits  and  yields  12  output  bits,  the  most  significant 

*  of  which  is  always  a  zero.  Thus  the  multiplier  can  effect  at  most  a  10- 

:  place  left  shift  at  one  time.  A  simple  4  to  10  decoder  plus  a  few  gates  are 

j  all  that  are  necessary  to  realize  the  mapping  network.  The  hardware  is 

;  '  thus  negligible. 

9 

)  5.  Negative  Scale  Factor  (SFACN) 

The  SFACN  operation  maps  a  4-bit  number,  N,  into  a  positive  12-bit 
11-N 

number,  2  ,  such  that  a  subsequent  fraction  multiply  with  the  contents 

of  a  selected  general  register  byte  will  effectively  shift  those  contents  N 
places  to  the  right.  This  permits  the  multiplier  to  be  used  as  a  right  shifter 
for  scaling  operations.  The  4-bit  input,  which  is  actually  the  low  order  4 
\  bits  of  an  appropriate  general  register  byte,  may  represent  any  integer 

,  number  in  the  range  0  <  N  <  llg.  Since  2^  cannot  be  represented  without 

overflow  into  the  sign  bit,  a  right  shift  of  zero  places  is  not  permitted. 

This  implies,  for  instance,  that  in  the  case  of  coefficient  alignment  for 
floating  point  operations,  the  possibility  of  equal  exponents  must  be  ex¬ 
plicitly  tested. 

Figure  28  illustrates  a  hardware  realization  for  the  SFACN  operation 
much  akin  to  that  for  SFACP.  It  can  be  realized  with  exactly  the  same 
hardware  except  that  the  outputs  (Y)  are  bit  reversed. 

I.  Scratch  and  Program  Memories 

Although  the  machine1  s  architecture  has  been  designed  to  accommo¬ 
date  4096-word  program  and  scratch  memories,  two  identical  1024-word 
memories,  one  for  the  program  memory  and  one  for  the  scratch  memory, 

.  are  proposed  to  reduce  the  machine’  s  cost.  The  word  length  for  both 

memories  will  be  24  bits  organized  as  two  12-bit  bytes,  and  the  read  and 
.  write  cycle  times  will  be  30  nsec  where  the  time  measurement  begins  when 

the  address  signals  at  the  input  to  the  memory  settle  down. 

The  basic  building  block  for  the  memories  is  the  AMS  512- word,  6- 
bits -per-word,  bipolar  semiconductor  card.  Nine  address  signals  are  de¬ 
coded  to  choose  one  of  512  words,  In  addition,  there  are  six  input  data 
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The  hardware  realization  of  the  SFACP  operation  is  depicted  in  Fig. 
27.  It  accepts  four  input  bits  and  yields  12  output  bits,  the  most  significant 
of  which  is  always  a  zero.  Thus  the  multiplier  can  effect  at  most  a  10- 
place  left  shift  at  one  time.  A  simple  4  to  10  decoder  plus  a  few  gates  are 
all  that  are  necessary  to  realize  the  mapping  network.  The  hardware  is 
thus  negligible. 

5.  Negative  Scale  Factor  (SFACN) 

The  SFACN  operation  maps  a  4-bit  number,  N,  into  a  positive  12-bit 
1 1-N 

number,  2  ,  such  that  a  subsequent  fraction  multiply  with  the  contents 

of  a  selected  general  register  byte  will  effectively  shift  those  contents  N 
places  to  the  right.  This  permits  the  multiplier  to  be  used  as  a  right  shifter 
for  scaling  operations.  The  4-bit  input,  which  is  actually  the  low  order  4 
bits  of  an  appropriate  general  register  byte,  may  represent  any  integer 
number  in  the  range  0  <  N  <  llg.  Since  2^  cannot  be  represented  without 
overflow  into  the  sign  bit,  a  right  shift  of  zero  places  is  not  permitted. 

This  implies,  for  instance,  that  in  the  case  of  coefficient  alignment  for 
floating  point  operations,  the  possibility  of  equal  exponents  must  be  ex¬ 
plicitly  tested. 

Figure  28  illustrates  a  hardware  realization  for  the  SFACN  operation 
much  akin  to  that  for  SFACP.  It  can  be  realized  with  exactly  the  same 
hardware  except  that  the  outputs  (Y)  are  bit  reversed. 

I.  Scratch  and  Program  Memories 

Although  the  machine' s  architecture  has  been  designed  to  accommo¬ 
date  4096-word  program  and  scratch  memories,  two  identical  1024-word 
memories,  one  for  the  program  memory  and  one  for  the  scratch  memory, 
are  proposed  to  reduce  the  machine'  s  cost.  The  word  length  for  both 
memories  will  be  24  bits  organized  as  two  12-bit  bytes,  and  the  read  and 
write  cycle  times  will  be  30  nsec  where  the  time  measurement  begins  when 
the  address  signals  at  the  input  to  the  memory  settle  down. 

The  basic  building  block  for  the  memories  is  the  AMS  512-word,  6- 
bits -per-word,  bipolar  semiconductor  card.  Nine  address  signals  are  de¬ 
coded  to  choose  one  of  512  words,  In  addition,  there  are  six  input  data 


Network  to  map  N  -*■  2  for  left  shift. 


28.  Network  to  map  N  -*■  2  ‘  for  right  shift. 


signals,  two  card  select  signals,  and  a  write  signal  that  are  sent  to  the  card. 
Six  data  output  signals  are  produced  when  the  card  is  read.  The  six  output 
signals  settle  down  25  nsec  after  the'nine  address  signals  arrive  at  the  card 
if  both  the  card  select  lines  are  low.  If  either  select  signal  is  high,  the 
memory  outputs  will  be  low;  thus,  the  select  signals  act  as  card  enables 
and  can  be  used  to  control  the  interconnection  of  memory  cards  to  build  a 
large  memory.  New  data  are  written  into  the  card,  after  the  six  new  data 
signals  and  the  address  at  which  they  are  to  be  stored  have  settled  down,  by 
generating  a  10-nsec  write  signal.  The  total  time  for  the  write  operation  is 
25  nsec. 

Eight  AMS  cards  are  interconnected  (Fig.  29)  to  form  a  1024  word, 
24-bit  memory.  The  cards  are  arranged  in  two  groups  of  four  cards,  each 


Fig.  29.  Program  memory/scratch  memory. 
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group  containing  512  24-bit  words.  Ten  address  signals  are  brought  to  the 
memory,  the  signals  that  represent  the  nine  least  significant  bits  of  the 
address  are  sent  to  all  eight  cards.  The  tenth  signal  is  used  to  choose 
which  512  word  half  of  the  memory  will  be  read  or  written.  Control  is 
achieved  by  sending  the  tenth  signal  itself  to  the  select  inputs  of  one  row 
of  four  cards  and  its  complement  to  the  select  inputs  of  the  other  row  of 
four  cards.  The  data  outputs  of  the  two  rows  of  cards  are  connected  to¬ 
gether  as  indicated.  Once  the  input  data  and  address  signals  have  set¬ 
tled  down,  a  memory  write  is  accomplished  by  enabling  the  write  signals 
p^  and  U)p^.  A  12-bit  byte  is  written  into  the  memory  by  forcing  either 
<«p^  or  U)p^  to  be  true,  the  choice  depending  upon  write  instructions;  a  24- 
bit  word  is  written  into  the  memory  by  enabling  both  (0p^  and  u)p^ 
s  imultan  e  ou  s  ly . 

The  scratch  memory,  M  ,  for  all  memory  instructions  except  for 

s 

block  transfers  and  those  involving  input  and  output  from  the  machine,  ob¬ 
tains  its  address  from  the  B  bus,  its  input  data  from  the  A  bus,  and  it 
sends  its  output  data  to  the  D  bus.  For  input-output  instructions,  the  ad¬ 
dress  and  input  data  come  from  the  1-0  function  box,  and  the  memory  out¬ 
put  is  sent  to  the  1-0  function  box.  On  block  transfer  instructions,  the  ad¬ 
dress  and  input  data  come  from  the  B  and  A  busevS,  respectively,  as  they 
do  for  most  other  instructions,  but  the  output  data  go  to  the  program  mem¬ 
ory  where  it  is  stored,  thus,  giving  us  the  capability  of  writing  programs 
that  modify  themselves. 

When  a  program  is  running,  the  program  memory' s  address  comes 
from  the  processor' s  program  address  register,  and  memory' s  input 
data  come  from  the  output  of  scratch  memory.  The  current  processor 
architecture  has  the  output  of  the  program  memory  only  going  to  the  in¬ 
struction  register,  but  the  addition  of  a  path  from  the  program  memory 
output  to  the  input  of  the  scratch  memory  is  being  considered.  The  path 
will  allow  us  to  easily  check  the  dynamic  operation  of  the  program  memory. 
The  memory  can  be  checked  dynamically  without  this  path,  but  only  in  an 
awkward  manner  by  forcing  a  test  program  to  relocate  itself  in  the  memory. 
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The  only  scratch  and  program  memory  address  end  data  paths  not 
mentioned  are  those  associated  with  the  console.  These  paths  allow  a  user 
or  another  computer  to  specify  a  data  word  and  write  the  word  into  a  spe¬ 
cific  address  in  either  memory,  or  to  specify  an  address  in  either  memory 
and  to  examine  the  data  word  stored  at  that  address. 

A  scratch  memory  read  instruction  requires  approximately  100  nsec 
and  a  write  instruction  approximately  80  nsec.  The  reason  for  the  time 
difference  is  that  a  memory  write  instruction  has  one  less  data  transmission 
path  than  a  read  instruction.  When  the  memory  is  read,  an  address  is  sent 
from  the  general  registers  to  the  memory  and  data  are  returned  from  the 
memory  to  the  general  registers;  whereas,  when  the  memory  is  written  an 
address  and  input  data  are  sent  to  the  memory  from  the  general  registers 
but  no  data  are  returned  to  the  general  registers.  The  100-nsec  read  in¬ 
struction  time  breaks  down  in  the  following  way:  30  nsec  to  read  an  address 
from  a  general  register,  10  nsec  to  transmit  the  address  from  the  B  bus  to 
the  memory  assuming  that  the  memory  is  in  a  different  enclosure  from  the 
general  registers,  30  nsec  to  read  the  memory,  10  nsec  to  send  the  mem¬ 
ory  output  back  to  the  D  bus,  5  nsec  to  get  the  data  to  the  general  register 
via  the  D  bus,  and  a  15-nsec  safety  factor.  The  time  breakdown  for  a  write 
instruction  is  the  same  except  that  it  does  not  include  the  15  nsec  to  send 
data  back  to  the  general  registers. 

It  takes  approximately  60  nsec  to  read  the  program  memory  assuming 
that  the  read  time  spans  the  interval  beginning  when  a  new  instruction  is 
clocked  into  the  instruction  register  and  ending  when  the  lew  instruction 
signals  arrive  back  at  the  input  of  the  instruction  register.  The  60  nsec 
breaks  down  in  the  following  way:  3  nsec  for  the  program  memory  address 
register  outputs  to  settle  down,  10  nsec  to  send  the  address  signals  to  the 
memory  assuming  it  is  in  another  enclosure,  30  nsec  to  read  the  memory, 

10  nsec  to  send  the  memory  output  back  to  the  instruction  register  input, 
and  a  7-nsec  safety  factor. 

To  build  a  1024-word  memory,  approximately  six  auxiliary  printed 
circuit  cards,  besides  the  eight  AMS  cards  will  be  needed.  The  auxiliary 
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cards  will  be  used  for  mounting  line  receivers,  line  drivers,  and  address 
drivers.  A  complete  memory  will  require  approximately  600  integrated 
circuits  and  dissipate  approximately  300  Watts. 

J.  Input- Output  Capability 

The  ASP  has  eight  input-output  channels,  two  of  which  are  full  duplex, 
DMA  data  channels  capable  of  handling  24-bit  data  words  (Fig.  30),  and  are 
equipped  with  two  pairs  of  input  and  output  control  lines.  The  remaining  six 
’'control"  channels  have  no  data  handling  capacity,  but  are  equipped  with  the 
same  control  facilities  as  the  DMA  channels. 


I8-6-K526 


OUTPUT  CHANNEL 


INPUT  CHANNEL 


Fig.  30.  Direct  memory  access  channel. 


The  input  side  of  one  of  the  two  DMA  channels  (channels  6  and  7)  con¬ 
sist  of  24  lines  of  data  input,  two  control  lines,  and  two  acknowledge  lines. 
Controls  are  named: 

IDR  input  data  request 
IDA  input  data  acknowledge 
ISR  input  status  request 
SRA  status  request  acknowledge. 

If  an  input  is  desired,  the  ASP  raises  either  the  IDR  line  or  the  ISR,  which 


63 


are  presumably  attached  to  the  addressed  peripheral.  When  the  requested 
data  are  on  the  input  lines,  the  peripheral  raises  the  appropriate  acknow¬ 
ledge  line  and  the  ASP  samples  the  data.  IDR  and  ISR  are  logically  equiva¬ 
lent  controls  that  provide  extra  flexibility:  the  addressed  peripheral  might 
place  different  types  of  data  on  the  lines  depending  on  which  control  line  is 
raised.  If  the  peripheral  sees  IDR,  it  will  place  a  piece  of  data  to  be  pro¬ 
cessed  on  the  lines.  If  it  sees  an  ISR,  it  will  place  a  data  word  on  the  lines 
relative  to  its  present  operating  condition. 

Similarly,  the  output  side  of  a  DMA  channel  is  equipped  with  24  lines 
of  data  output,  two  control  lines,  and  two  acknowledge  lines.  Controls  are 
named: 

ODR  output  data  request 

ODA  output  data  acknowledge 

EFR  external  function  request 

EFA  external  function  acknowledge. 

If  the  ASP  desires  to  output  a  data  word,  it  raises  either  the  ODR  or  EFR 
lines.  When  the  addressed  peripheral  has  sampled  the  lines,  it  raises  the 
appropriate  acknowledge  line.  ODR  and  EFR  are  logically  equivalent  sig¬ 
nals.  EFR  might  signal  the  addressed  peripheral  to  interpret  the  incoming 
datum  as  a  control  word  intended  to  establish  an  operating  mode  rather  than 
as  a  simple  piece  of  information. 

These  channels  are  termed  DMA  in  the  sense  that,  once  a  data  buffer 
has  been  initiated  by  an  appropriate  program  instruction,  i.e.,  control 
parameters  passed  from  Mf  to  the  1-0  handling  logic,  the  channel  automatic¬ 
ally  accesses  Mg  as  necessary,  calculates  Mg  addresses  automatically,  and 
signals  the  control  processor  when  the  entire  buffer  has  been  transmitted. 
The  "done"  signal  can  engender  an  interrupt  to  the  user  program,  or  can 
simply  set  a  flag  that  can  be  explicitly  tested  by  programmed  instructions 
at  the  option  of  the  user. 

Input  and  output  buffers  may  be  active  on  both  sides  of  a  given  DMA 
channel,  simultaneously.  It  is  also  possible  for  both  DMA  channels  to  be 
active  simultaneously  with  input,  output,  or  both.  In  such  instances, 
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conflicts  may  arise  between  channels  for  access  to  Mg.  In  fact,  the  pro¬ 
gram  running  in  the  CPU  may  also  desire  use  of  M  at  any  given  point. 

When  conflicts  arise  between  channels,  Mg  access  will  be  apportioned  such 
that  channel  6  is  given  priority  over  channel  7.  When  conflicts  arise  be¬ 
tween  the  input  and  output  sides  of  the  same  channel,  access  will  be  inter¬ 
leaved,  input  being  served  first.  In  conflicts  with  the  CPU,  the  CPU  will 
be  permitted  to  finish  the  instruction  in  progress.  As  soon  as  the  CPU  is 
finished  with  Mg,  the  highest  priority  1-0  commitment  outstanding  will  be 
serviced.  Any  subsequent  CPU  Mg  accesses  will  be  deferred  until  the  queue 
of  pending  1-0  related  accesses  has  been  processed. 

The  six  control  channels  (0  through  5)  (Fig.  31)  are  basically  identical 
in  terms  of  control  lines  to  the  DMA  channels.  The  essential  difference  is 
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—  IDA 
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Fig.  31.  Control  channel. 
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OUTPUT  CHANNEL 


INPUT  CHANNEL 


in  the  absence  of  data  handling  paths  eliminating  the  need  to  access  M  . 

s 

This  simplification  greatly  reduces  the  hardware  necessary  to  realize  a 
given  channel.  Except  for  the  lack  of  data,  the  control  channels  operate 
the  same  way  as  the  DMA  channels,  even  to  the  point  of  interrupt  generation, 
if  desired.  These  channels  are  designed  for  use  in  computer  networks 
where  synchronization  and  simple  control  paths  between  processors  are 
of  value,  but  full  duplex  data  links  are  an  unnecessary  luxury. 


Priority  issues  with  regard  to  Mg  access,  clearly  do  not  arise  with 
the  control  channels.  However,  two  channels  programmed  to  interrupt  when 
they  have  completed  their  assigned  tasks  may  try  to  signal  the  CPU  at  the 
same  time.  This  situation  can  occur  with  the  DMA  channels  ,  too.  In  such 
cases,  the  input  side  of  a  particular  channel  is  given  priority  over  the  out¬ 
put  side;  and  channel  priority  is  determined  by  the  channel  number,  i.  e. , 
the  lower  the  channel  number,  the  higher  its  priority.  This  implies  that  the 
control  channels  have  priority  over  the  DMA  channels. 

The  3-bit  afield  in  the  instruction,  format  to  program  the  1-0  system 
(Fig.  32a)  selects  the  channel  to  be  actuated.  The  p  (or'tnonitor")  bit,  if  set, 
causes  an  interrupt  to  be  issued  when  the  selected  channel  has  finished  its 
assignment.  The  interrupt  will  cause  a  program  branch  to  a  prescribed 
subroutine,.  If  |x  is  not  a  1,  a  flag  will  be  set  when  the  channel  is  done,  which 
can  be  explicitly  tested.  The  Y  field  specifies  the  nature  of  the  operation  to 
be  performed  and  is  interpreted  as  follows: 

0  input  request 

1  input  status  request 

2  ou.fout  request 

3  e.-t-ernal  function  request. 


18-6-14528 

OP  CODE  A 

(a)  INSTRUCTION  FORMAT 

BLOCK  SIZE 

INTRPT.  SERVICE  ROUTINE  ENTRY  PT. 

(b)  A  REGISTER 

INCREMENT 

■ _ i _ ■  ■  I,,  i _ ■  i  i 

STARTING  Ms  ADDRESS 

.  .  1  ■■  1  i  1 - 1 - 1 - 1 - « - t  —-I..  .  .1 . .  ..i....-.,.. 

(c)  B  REGISTER 


Fig.  32.  1-0  format  conventions,  (a)  instruction  format, 
(b)  A  register,  (c)  B  register. 
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The  A  and  B  fields  specify  general  registers  that  are  interpreted  as  shown 
in  Figs.  33b  and  c,  respectively.  These  registers  supply  the  necessary 
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OP  CODE  A 

B 

(a)  INSTRUCTION  FORMAT 

BLOCK  SIZE 

STARTING  MP  ADDRESS 

_ 1 _ 1 _ 1 _ 1 _ l _ f _ I _ 1 _ ! _ 1 _ I _ 

(b)  A  REGISTER 

INCREMENT 

STARTING  Ms  ADDRESS 

(c)  B  REGISTER 


Fig.  33.  Block  transfer  format  conventions,  (a)  instruction  format, 

(b)  A  register,  (c)  B  register. 

control  parameters  to  the  1-0  logic  to  effect  the  desired  task.  The  upper 

byte  of  A  contains  a  number  corresponding  to  the  number  of  data  words  to 

be  transferred.  The  lower  byte  points  to  the  M  location  to  which  program 

control  is  to  be  transferred  when  the  channel  is  done  (if  (x  =  1).  The  upper 

byte  of  B  contains  a  signed  number  that  defines  the  displacement  between 

locations  successively  accessed  in  M  .  For  example,  if  equal  to  +1,  sue- 

s 

cessive  Mg  locations  will  be  accessed  in  order  of  increasing  address.  The 
lower  byte  of  B  points  to  the  M  location  to  be  accessed  by  the  first  transfer. 
Implied  by  the  foregoing  is  that  only  one  side  of  one  channel  may  be  acti¬ 
vated  by  a  given  instruction.  Also,  in  the  case  of  control  channels,  the  B 
register  is  irrelevant  since  no  data  are  actually  transferred. 

When  an  I-  O  precipitated  interrupt  occurs,  the  return  point  (P  +2)  is 
written  into  the  lower  byte  of  the  A  register  after  the  service  routine  en¬ 
trance  point  has  been  read  into  P.  This  technique  saves  on  the  number  of 
general  registers  necessary  to  service  the  I-O,  but  requires  that  the  lower 
byte  of  A  be  restored  in  some  fashion  when  the  service  routine  terminates. 
The  RJP  instruction  was  designed  with  this  purpose  in  njiind  and  is 


documented  in  the  Appendix  along  with  the  1-0  jumps  intended  to  explicitly 
test  for  completed  1-0  transactions. 

Figures  33  b  and  c  show,  respectively,  the  instruction  format  and  the 
two  control  parameter  register  formats  for  the  block  transfer  instruction. 
The  block  transfer  is  not  an  1-0  operation  in  the  strict  sense,  but  rather  in¬ 
volves  an  internal  two-way  data  transfer  path  between  M  and  M  .  The  con- 

s  p 

trol  parameter  register  formats  are  similar  to  those  of  genuine  1-0  opera¬ 
tions  (the  lower  A  byte  designates  the  first  M  location  to  be  accessed)  as  is 

P 

the  necessary  hardware. 

No  interrupts  are  involved  with  block  transfers.  Program  execution 
essentially  halts  while  the  transfer  is  in  progress  and  resumes  upon  buffer 
completion.  If  the  block  transfer  modifies  M^,  the  first  instruction  exe¬ 
cuted  on  resumption  of  normal  operation  is  that  to  which  control  normally 

would  have  been  transferred  prior  to  the  M  modification.  If  no  M  modifi- 

P  P 

cation  occurred,  no  question  arises.  This  instruction  permits  dynamic 
alteration  of  the  running  program  and  facilitates  maintenance  of  the  program 
memory  (Appendix). 


K.  User  Console 

The  ASP  is  equipped  with  a  console  to  monitor  and  control  the  operat¬ 
ing  status  of  the  machine,  interact  with  and  debug  user  programs,  and  to 
supplement  standard  engineering  maintenance.  The  ASP  console  design  is 
consistent  with  these  ground  rules: 

(1)  Preservation  of  Machine  Status 

Machine  status  interrogations  will  not  alter  the  state  of  the  computer: 

Examination  of  the  contents  of  a  selected  M  location  will  not  alter  the  con- 

s 

tents  of  the  M  address  register.  The  same  is  true  of  data  entry.  The  only 
s 

permissible  status  change  is  that  engendered  by  the  deposition  of  inputted 
data. 

(2)  Complete  Examination  of  Machine  Status 

Every  possible  useful  register  or  group  of  registers  viewable,  and 
where  applicable,  alterable. 
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(3)  Minimum  Interconnections 

Minimum  signal  paths  connect  the  console  and  the  ASP.  This  re¬ 
striction  simplifies  the  console,  enhances  cable  and  connector  reliability, 
reduces  required  connector  parrs,  and  diminishes  noise  pickup  that  might 
be  injected  into  the  ASP.  Noise  and  cable  complexity  are  important  if  the 
console  is  remote  from  the  computer.  If  so,  a  modem  set  for  data  trans¬ 
mission  might  be  desirable.  The  amount  of  multiplexing  required  will  have 
been  drastically  reduced  at  the  outset. 

(4)  Possibility  of  Automatic  Control 

To  permit  the  user  to  interface  with  the  ASP  via  a  general-purpose 
computer  and  associated  1-0  devices,  a  computer  (possibly  mini -computer) 
whose  resident  software  can  be  written  to  simulate  the  presence  of  the  con¬ 
sole,  might  be  installed.  The  monitor  software  could  perform  powerful 
sequences  of  console  operations  at  high  speed: 

(a)  The  entire  state  of  the  ASP  could  be  dumped  on  command, 

(b)  An  entire  buffer  could  be  dumped  into  M^  or  Mg, 

(c)  A  particular  program  loop  could  be  executed  a  prescribed 
number  of  times. 

The  possibilities  of  such  a  scheme  are  legion. 

Tentative  inputs  and  outputs  for  the  console  include: 

I.  Indicator  Outputs 


General 

1. 

Instruction  register 

IR 

2. 

Program  memory 

M 

p 

3. 

Program  counter 

P 

4. 

Scratch  memory 

M 

s 

5. 

Scratch  memory  address 

M  AR 
s 

6. 

General  register  memory 

M 

r 

7. 

Bus  address  registers 

A,  B,  D 

Control 

1. 

Machine  stop 

2. 

Machine  run 

3. 

Timing  generator  status 
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C.  1-0 


1.  Direct  memory  access  channels 


a. 

Request  status 

IR, 

OR, 

ISR,  EFR 

b. 

Data  input  Mg  address 

c. 

Data  input  increment 

d. 

Data  input 

e. 

Data  output  Mg  address 

f. 

Data  output  increment 

g* 

Data  output 

2.  Inter -computer  channels 

a. 

Request  status 

IR, 

OR, 

ISR,  EFR 

Switch  Inputs 

General 

1. 

Program  memory 

toggle 

2. 

Program  counter 

toggle 

3. 

Write  program  memory 

push  button 

4. 

Read  program  memory 

push  button 

5. 

Scratch  memory 

toggle 

6. 

Scratch  memory  address 

toggle 

7. 

Write  scratch  memory 

push  button 

8. 

Read  scratch  memory 

push  button 

9. 

General  register  address 

toggle 

10. 

Read  general  register 

push  button 

Control 

1. 

Stop  machine 

push  button 

2. 

Cycle  machine 

push  button 

3. 

Step  machine 

push  button 

4. 

Resume  execution 

push  button 

5. 

Start  execution  at  program  counter 
switches 

push  button 

6. 

Stop  when  program  counter  equals 
switches 

toggle 

7. 

Programmed  stop  switches 

toggle 

8. 

Programmed  skip  switches 

toggle 
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The  console  consists  of  several  light  registers,  switch  registers,  and 
a  command  keyboard.  The  light  registers  permit  continuous  monitoring  of 
certain  machine  conditions  (P  register,  machine  run/stop,  timing  generator 
status)  and  provide  optional  interrogation  of  others.  Internal  conditions  may 
be  observed  on  command  via  a  general  light  register. 

The  several  toggle  switch  registers  are  necessary  to  permit  inputting 
two  or  more  pieces  of  information  simultaneously,  such  as  address  and  data 
to  load  one  of  the  memories.  Some  switches  must  also  be  available  for  con¬ 
tinuous  use:  the  program  skip  and  program  stop  switches. 

Commands  are  issued  to  the  ASP  via  the  keyboard.  All  command  push 
buttons  are  realized  this  way  as  well  as  are  the  status  interrogation  options. 
A  function  code  is  transmitted  to  the  ASP  when  each  key  is  depressed.  The 
code  causes  the  console  multiplexing  logic  internal  to  the  ASP  to  bring  the 
data  desired  onto  the  console  lines.  Keys  corresponding  to  command  push 
buttons  dispatch  .ppropriately  timed  pulses  along  with  the  function  code  to 
the  ASP.  The  pulses  are  properly  steered  inside  the  ASP  to  effect  the  de¬ 
sired  exercise. 

L.  Construction 

The  dashed  line  in  Fig.  34  indicates  where  the  system  will  be  parti¬ 
tioned.  The  function  boxes  to  the  left  of  the  line  will  be  housed  in  a  single 
drawer  called  the  processor  drawer  and  those  to  the  right  of  the  line  will  be 
housed  in  a  second  drawer  called  the  memory  drawer. 

The  circuits  in  the  processor  drawer  will  be  mounted  on  either  printed 
circuit  or  wire-wrap  boards  which  are  7  in.  wide  and  17  in.  long  and  have 
PC  edge  connector  contacts  for  plugging  m  and  out  of  back  plane  connectors. 

The  wire -wrap  boards  will  have  a  ground  plane,  a  voltage  plane  and  a 
terminating  voltage  plane  and  will  use  short  2-wrap  pins  for  better  card 
packing  density.  Two  types  of  wire-wrap  boards  will  be  used,  one  will  hold 
a  mix  of  96  16-  and  24-pin  dual-in-line  integrated  circuits,  and  the  other 
will  hold  approximately  130  16-pin  dual-in-line  circuits. 

The  complete  processor  drawer  will  contain  approximately  1400  inte¬ 
grated  circuits  mounted  on  16  boards  and  will  dissipate  approximately  300 
Watts. 
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Fig.  34.  System  partitioning. 


Tne  memory  drawer  will  contain  the  scratch  memory  and  the  program 
memory.  There  will  be  16  AMS  memory  cards  in  the  drawer  plus  another  12 
auxiliary  cards.  Approximately  1200  integrated  circuits  will  be  mounted  on 
the  28  cards  and  they  will  require  600  Watts. 

Figure  35  shows  a  tentative  outline  drawing  of  the  computer  which  re¬ 
flects  the  desire  to  package  the  16  processor  cards  in  a  17  x  19  x  10  in. 
enclosure,  and  the  28  memory  cards  in  a  memory  drawer  that  is  17  x  19  x 
12  in.  Cool  air  will  be  forced  through  both  drawers  to  insure  a  maximum 
temperature  rise  of  no  more  than  15°  C.  This  permits  operation  in  a  45°C 
ambient,  which  is  10°C  below  the  70°C  maximum  operating  temperature  of 
the  AMS  card  logic  and  15°C  below  the  75°C  maximum  operating  temperature 
of  MECL  10K  logic. 

The  processor  and  memory  drawers  will  be  built  as  black  boxes  with 
few,  if  any,  external  controls.  A  portable  console  will  be  provided  to 
troubleshoot  programs  and  hardware. 

The  system  will  be  powered  by  a  bank  of  low  voltage,  high  current 
supplies. 
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APPENDIX 


ASP  PROGRAMMING  INSTRUCTIONS 


I.  Adds/Subtracts 
Format  I 


Op  Code 


B 


[Aj  +(B,J  ->(D  ] 

IV  +[!>,]-  [Dr3 
U£]  +  [B^  [D^ 
[Ajj]  +  [D£] 


❖  t  $ 


00,  1 
0,3 
0,  5 
0,7 
10,  11 
12,  13 
14,  15 

16,17  i^A^]  +[B[i])->[D(x]  ;  ^ ([ A^ ]  ±[B£})*[d(] 
20,21  [A]  +[B]-»  [D]  ,  DOUBLE  PRECISION 


[A  ]  +  [B  ]  -»  [D J  ;  [Aj  +  [Bj  -*  [Dj 
[A  ]  ±[B£]  -[D^]  ;[A£]  +[B|1]^[D1] 


D 


Format  II 


77 

77 

77 

77 


00,1 

02,3 

04,5 

06,7 


Op  Code 


Sub 

Op  Code 


B 


D 


\  ([  D>J  +  [B,])  ->  [Dj;  ^([Dj]  +  [Bj)  ->  [D^J 


2  ([  D^]  +  [  B^J)  -*  [  Djj,];  2([D£]  ±  [B£])  ->  [D£] 

2  ([  D|J  +  [  Bjg  ]  )  -»  [  Djj,];  2(1  Di  ]  +  [  B^] )  [D£] 

|[Bj|  -»  iDj;  |[Bj|  -»[d#} 


77 

77 

77 

77 

77 

77 


10,  11  Ld]  +  [  B^]  -»  [D] 
12,  13  2(  [D]  +[Bi])-»[D] 
14,15  [  D]  +  [  B^J  -»  [  D3 

16,  17  i([D]  +  [B^]  -»  [  D] 
20,21  2([D]  +  [B]  )  ->  [D3 
22,23  \([D)  +  [  B]  )->  [  D] 


>  DOUBLE  PRECISION 


*  All  arithmetic  is  2*  s  complement. 

t  Subscripts  P-  and  £  refer  to  upper  and  lower  bytes,  respectively. 
$  [X]  refers  to  "contents  of  X." 
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II.  Logical  Operations 


IIJ. 


TV. 


V. 


6 

6 

6 

6 

Format  I 

Op  Code 

A 

E 

D 

22  [A]  A  [B]  -  [D] 

23  [A]  V  [B]  ^  [D] 

24  bits 

24  [A]  @[b]  -  [D] 

6 

6 

6 

6 

Format  II 

j  Op  Code 

Sub 

Op  Code 

B 

D 

77  24  [B]  [D]  ,  24  bit. 

77  25  [Bj  -  [DjJ  ;  [BpJ 

77  26  [BjJ  -  [D^hfBj] 

1-  [D,] 

-[DJ 

6 

6 

6 

6 

Multiplication  Operations 

| 

Op  Code 

A 

B 

D 

25 

[V 

X  [Bj  -  [D^] 

26 

[A  ] 
V- 

X  [Bj]  -  [D,J 

27 

[AJ 

X  [Bj  -  [D^  j 

3  91 

[A,] 

xc^]-^] 

31 

[Ap] 

X  [Bp.]  -  [Dp.];  [Aj 

32 

CAjJ 

X  [Bt]~  [DpJ;  [A_g] 

33 

[A,] 

X  [Bp.]  -  [D,J;  [A^] 

34 

[A,] 

X  [Bj  -  [D]  1 

35 

[A,] 

X  [Bp.]  -  [D]  ' 

ID,] 

[°|) 


[D.] 


Fraction  multi¬ 
plies:  Bits  12  - 
23  of  product 
outputted. 


Integer  multiply: 
Bits  1  -  12  of 
product  outputted. 


Full  multiplies:  All  product 
bits  are  outputted. 


x.  r 

6 

6 

6 

6 

Division  Operations 

Op 

Code 

A 

B 

D 

36  [A]  4-  [Bp.]-[D)x] 

37  [A]  4-  [Bjg]  -*■  [D^] 

j.  Remainder  on 

6  6 

Dr 

6 

6 

Square  Root  Function 

Op 

Code 

Sub 

Op  Code 

B 

D 

77  27  ([A])1/2  -  [DjJ 
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VI.  Constants 


6 

Op  Code 


12 

~Y 


6 

D 


40  Y  -  [DjJ 

41  Y  -*■  [Djg] 

42  Y  +  [D}J  -  [DjJ 

43  Y  +  [D^]  -  [Dj 

VII.  Special  Functions 


Y  is  11  bits  plus  sign. 

6  6  6  6 


77  30  [D^]  +  BRV{[B^])-*-  [D^],  Bit-Reversed  Add.  Bit  reverse 

of  [B£]  added  to  [D^],  carry  propagates  left  to  right. 


77  31  (N  -  l)-*[Dg],  N  =  number  leading  Is  or  Os  in  [B^j  . 

Scale  Function  ,  for  normalization. 

77  32  2^-^-*  [D^],  Positive  Scale  Factor.  Left  shifting.  (SFACP). 

77  33  [Dg],  Negative  Scale  Factor.  Right  shifting. 

(SFACN). 

77  34  [B g ]  -*•  [D^],  Zero  shifted  into  bit  12.  For  double¬ 

precision  multiplies 

VIII.  Memory  Reference  Ops 

77  35  [MS(B^)]  ->  [D^] 

77  36  [Ms(Bx)]  -  [DjJ 

77  37  [Ms(B(jl)3  ->  [Djg] 

77  40  [MgtB^]  -  [D£] 

77  41  [D  ]  -  [Ms(B  )] 

77  42  [D  ]  -  [Mg(Bf)] 

77  43  [D£]  -  [Ms(B^)] 

77  44  [D£]  -  [Ms(Bp3 


(ZINJ). 


3 

|  12-bit  transfers. 


VIII.  Memory  Reference  Ops  (continued) 


77 

45 

K(Bp.)]  -  ^ 

77 

46 

[Mg(Bjg )]  -  [D] 

77 

47 

[D]  -  [Ms(B(x)j 

77 

50 

[D]  -  [Ms(B£)] 

24-bit  transfers. 


6 

1 

i 

10 

6 

X.  Arithmetic  Branches 

j  Op  Code  | 

kl 

l±J 

Y 

D 

•Y:  Jump  destination 

a:  If  set,  skip  next  instruction  if  jump  occurs  (SOJ) 

(3  :  If  set,  test  upper  byte.  Else  test  lower. 

Return  point:  P  +  2  -*■  R1 

44  JPR:  Jump  if  [D^  >  0 

45  JNR:  Jump  if  [D  £]  <  0 

46  JZR:  Jump  if  [D  =  0 

47  JUZR:  Jump  if  [D  *  ^  ]  /  0 

50  JPZR:  Jump  if  [D  ’ ^ ]  >  0 

51  JNZR:  Jump  if  [D  ’  J  <  0 

[X ,  x 

52  JZRD:  Jump  if  [d]  =  0.  . 

^  I  Test  both  bytes  always, 

53  JUZRD:  Jump  if  [d]  /  0.  /  (3  not  used. 


6 

1 

i 

10 

/ 

U 

X. 

Unconditional  Branches  j 

Op  Code 

H 

p 

Y 

D 

54 

55 


Y:  Jump  destination 

a  :  If  set,  skip  next  instruction  when  jump  o.eurs. 

JPS:  Jump  to  Y  and  save  P  +  2  in  [D^  as  specified 

by  P  . 

XJP:  Jump  to  Y  +  [D  .]  as  specified  by  p.  Save 

P  +  2  in  R1 
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Skip  next  instruction. 


57  RJP:  Jump  to  [D^]  and  write  Y  into  [D^  ]  . 

Skip  next  instruction.  Used  for  closing  interrupt 
service  routines.  Y  is  entry  point. 


Notes:  1)  There  is  an  overflow  flag  for  each  D-bus  byte. 

2)  All  overflow  jumps  save  P  +  2  in  Rl|x- 

3)  Flags  may  be  set  by  following  single  precision  ops: 

a)  Add  or  subtract 

b)  Magnitude  function 

c)  Left  shifts 

d)  Multiplies 

e)  Divisions  (upper  byte  only) 

f)  Square  root,  if  operand  negative.  (Upper  byte.  ) 

4)  Upper  byte  flag  only  can  be  set  by  double  precision  ops 

a)  Adds  or  subtracts 

b)  Left  shifts 

5)  Control  transferred  to  Y. 

77  51  CLOV:  Clear  all  overflow  flags. 

77  52  JOVL:  Jump  on  lower  byte  overflow  and  clear  flag.  Next 

instruction  always  executed. 

77  53  JOVLS:  Jump  on  lower  byte  overflow  and  clear  flag. 

Next  instruction  skipped  if  jump  occurs. 

77  54  JOVU:  Jump  on  upper  byte  overflow  and  clear  flag.  Next 

instruction  alwavs  executed. 
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77  55  JOVUS:  Jump  on  upper  byte  overflow  and  clear  flag. 

Next  instruction  skipped  if  jump  occurs. 

77  56  JOVUL:  Jump  if  either  overflow  set,  do  not  clear  flags. 

Next  instruction  always  executed. 

77  57  JOVULS:  Jump  if  either  overflow  set  and  do  not  clear 

flags.  Next  instruction  skipped  if  jump  occurs. 


6 

6 

6 

2 

1 

3 

XII.  Input/Output 

Op  Code 

A 

B 

Li] 

[hJ 

1 1 

60  DMA:  Initiate  automatic  input/output  sequence  according 

to  the  following  rules: 

a)  £  selects  1  of  8  channels.  Channels  6  and  7  are 
direct  memory  access  data  channels.  Channels 
0-5  are  control  channels  and  have  no  associated 
data  paths. 

b)  Y.  selects  1-0  function  desired: 

0  -  Input  request 

1  -  Input  status  request 

2  -  Output  request 

3  -  External  function  request 

c)  £  is  the  monitor  interrupt.  Main  program  is  inter¬ 
rupted  when  1-0  buffer  is  complete. 

d)  A,  B  select  general  registers  which  are  interpreted 
as  follows : 


12 

12 

Size  cf 

Interrupt 

Data  Block 

Service  Return 

Entry  Point 

12 

12 

Increment 

Mg  Starting 

Address 

80 


6 


12 


3 


3 


j  Op  Code 


61  IOJP:  Jump  to  Y  if  the  condition  specified  by  y  is  met 

by  the  channel  selected  by  <r.  Save  P  +  2  in  R1  . 
Skip  next  instruction  if  jump  occurs. 


Y provides  for  the  following  tests: 

$  -  Input  inactive 

1  -  Input  status  request  inactive 

2  -  Output  inactive 

3  -  External  function  request  inactive 

4  -  Input  or  output  inactive 

5  -  Input  status  request  or  external  function  request 

inactive 

6  -  Input  active 

7  -  Output  active 


6 

6 

6 

1 

5 

XIII.  Block  Transfer 

Op  Code 

A 

B 

a 

62  BLOK:  Transfer  a  list  of  words  between  M  and  M  . 

s  p 

Machine  is  effectively  stopped.  Program  execu¬ 
tion  resumes  after  BLOK  is  complete  v/ith  the 
first  instruction  subsequent  to  the  BLOK  prior 

to  M  modification.  A  and  B  select  general 
P 

registers  which  are  interpreted  as  follows: 


12  12 


Size  of 

Data  Block 

Starting 

M  Address 

P 

12  12 

Increment 

Starting 

Ms  Address 

a  =  Ms  to  Mp. 
a  =  1 :  Mp  to  Ms. 
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6  6  8  1  11  1 

77  77  Stop  on  switches  - 

STPS:  Stop  the 

computer  if  the  combination  of  stop  switch  settings 
delineated  by  S,  ?  ,  .is  encountered.  If  all  S  bits 
are  set,  computer  halts  unconditionally.  Normally, 
only  one  S  bit  is  set. 

Note:  1)  52^  of  the  64  6-bit  Op  Codes  have  been  assigned. 

2)  5_3  of  the  64  12-bit  Op  Codes  have  been  assigned. 

3)  These  are  only  tentative  assignments. 
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